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Preface 


This book has grown out of printed notes which accompanied lectures 
given by ourselves and our colleagues over many years to undergraduate 
mathematicians at Oxford. During those years the contents and the 
arrangement of the lectures have changed substantially, and this book 
has a wider scope than is currently taught. It contains mathematics 
which, in an ideal world, would be part of the equipment of any well- 
educated mathematician. 

Numerical analysis is the branch of mathematics concerned with the 
theoretical foundations of numerical algorithms for the solution of prob- 
lems arising in scientific applications. The subject addresses a variety of 
questions ranging from the approximation of functions and integrals to 
the approximate solution of algebraic, transcendental, differential and 
integral equations, with particular emphasis on the stability, accuracy, 
efficiency and reliability of numerical algorithms. The purpose of this 
book is to provide an elementary introduction into this active and ex- 
citing field, and is aimed at students in the second year of a university 
mathematics course. 

The book addresses a wide range of numerical problems in algebra 
and analysis. Chapter 2 deals with the solution of systems of linear 
equations, a process which can be completed in a finite number of arith- 
metical operations. In the rest of the book the solution of a problem 
is sought as the limit of an infinite sequence; in that sense the output 
of the numerical algorithm is an ‘approximate’ solution. This need not, 
however, mean any relaxation of the usual standards of rigorous anal- 
ysis. The idea of convergence of a sequence of real numbers (x,,) to a 
real number € is very familiar: given any positive value of € there exists 
a positive integer No such that |v, — €| < € for all n such that n > No. 
In such a situation one can obtain as accurate an approximation to € as 
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required by calculating sufficiently many members of the sequence, or 
just one member, sufficiently far along. A ‘pure mathematician’ would 
prefer the exact answer, €, but the sorts of guaranteed accurate approxi- 
mations which will be discussed here are entirely satisfactory in real-life 
applications. 

Numerical analysis brings two new ideas to the usual discussion of 
convergence of sequences. First, we need, not just the existence of No, 
but a good estimate of how large it is; and it may be too large for 
practical calculations. Second, rather than being asked for the limit of 
a given sequence, we are usually given the existence of the limit € (or 
its approximate location on the real line) and then have to construct a 
sequence which converges to it. If the rate of convergence is slow, so 
that the value of No is large, we must then try to construct a better 
sequence, one that converges to € more rapidly. These ideas have direct 
applications in the solution of a single nonlinear equation in Chapter 
1, the solution of systems of nonlinear equations in Chapter 4 and the 
calculation of the eigenvalues and eigenvectors of a matrix in Chapter 5. 

The next six chapters are concerned with polynomial approximation, 
and show how, in various ways, we can construct a polynomial which 
approximates, as accurately as required, a given continuous function. 
These ideas have an obvious application in the evaluation of integrals, 
where we calculate the integral of the approximating polynomial instead 
of the integral of the given function. 

Finally, Chapters 12 to 14 deal with the numerical solution of ordinary 
differential equations, with Chapter 14 presenting the fundamentals of 
the finite element method. The results of Chapter 14 can be readily 
extended to linear second-order partial differential equations. 

We have tried to make the coverage as complete as is consistent with 
remaining quite elementary. The limitations of size are most obvious 
in Chapter 12 on the solution of initial value problems for ordinary 
differential equations. This is an area where a number of excellent books 
are available, at least one of which is published in two weighty volumes. 
Chapter 12 does not describe or analyse anything approaching all the 
available methods, but we hope we have included some of those in most 
common use. 

There is a selection of Exercises at the end of each chapter. All these 
exercises are theoretical; students are urged to apply all the methods 
described to some simple examples to see what happens. A few of the 
exercises will be found to require some heavy algebraic manipulation; 
these have been included because we assume that readers will have ac- 
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cess to some computer algebra system such as Maple or Mathematica, 
which then make the algebraic work almost trivial. Those involved in 
teaching courses based on this book may obtain copies of XTpxX files con- 
taining solutions to these exercises by applying to the publisher by email 
(solutions@cambridge.org). Although the material presented in this 
book does not presuppose the reader’s acquaintance with mathematical 
software packages, the importance of these cannot be overemphasised. 
In Appendix B, a brief set of pointers is provided to relevant software 
repositories. 

Our treatment is intended to maintain a reasonably high standard of 
rigour, with many theorems and formal proofs. The main prerequisite 
is therefore some familiarity with elementary real analysis. Appendix A 
lists the standard theorems (labelled Theorem A.1,A.2,...,A7) which 
are used in the book, together with proofs of one or two of them which 
might be less familiar. Some knowledge of basic matrix algebra is as- 
sumed. We have also used some elementary ideas from the theory of 
normed linear spaces in a number of places; complete definitions and ex- 
amples are given. Some prior knowledge of these areas would be helpful, 
although not essential. 

The chart below indicates how the chapters of the book are inter- 
related. They show, in particular, how Chapters 1 to 5 form a largely 
self-contained unit, as do Chapters 6 to 10. 


Roadmap of the book 


Chapter 1 


Chapter 4 | <= | Chapter 2 


Chapter 3 | => | Chapter 11 | <= || Chapter 6 

aU aU 
Chapter 5 a Chapter 7 

aU 
Chapter 13 | => | Chapter 14 Chapter 8 

aU aU 
=> Chapter 12 | <— | Chapter 10 | <= | Chapter 9 
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We have included some historical notes throughout the book. As well 
as hoping to stimulate an interest in the development of the subject, 
these notes show how wide a historical range even this elementary book 
covers. Many of the methods were developed by the great mathemati- 
cians of the seventeenth and eighteenth centuries, including Newton, 
Euler and Gauss, but what is usually known as Gaussian elimination for 
the solution of systems of linear equations was known to the Chinese two 
thousand years ago. At the other end of the historical scale, the analy- 
sis of the eigenvalue problem, and the numerical solution of differential 
equations, are much more recent, and are due to mathematicians who 
are still very much alive. Many of our historical notes are based on the 
excellent biographical database at the history of mathematics website 


http://www-history.mcs.st-andrews.ac.uk/history/ 


We have tried to eradicate as many typographical errors from the text 
as possible; however, we are mindful that some may have escaped our 
attention. We plan to post any typos reported to us on 


http://web.comlab.ox.ac.uk/oucl/work/endre.suli/index.html 


We wish to express our gratitude to Professor Bill Morton for setting 
us off on this tour de force, to David Tranah at Cambridge University 
Press for encouraging us to persist with the project, and to the staff 
of the Press for not only improving the appearance of the book and 
eliminating a number of typographical errors, but also for correcting 
and improving some of our mathematics. We also wish to thank our 
colleagues at the Oxford University Computing Laboratory, particularly 
Nick Trefethen, Mike Giles and Andy Wathen, for keeping our spirits up, 
and to Paul Houston at the Department of Mathematics and Computer 
Science of the University of Leicester for his help with the final example 
in the book. 

Above all, we are grateful to our families for their patience, support 
and understanding: this book is dedicated to them. 


ES & DFM Ozford, September 2002. 
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Solution of equations by iteration 


1.1 Introduction 


Equations of various kinds arise in a range of physical applications and 
a substantial body of mathematical research is devoted to their study. 
Some equations are rather simple: in the early days of our mathematical 
education we all encountered the single linear equation ax+b = 0, where 
a and 6 are real numbers and a ¥ 0, whose solution is given by the 
formula « = —b/a. Many equations, however, are nonlinear: a simple 
example is ax? + ba +c = 0, involving a quadratic polynomial with real 
coefficients a, b, c, and a 4 0. The two solutions to this equation, labelled 
x, and £9, are found in terms of the coefficients of the polynomial from 
the familiar formulae 
_ —b+ Vb? — 4ac _ —b— Vb? — 4ac 11 
7 2a , cc 2a : ey) 
It is less likely that you have seen the more intricate formulae for the 
solution of cubic and quartic polynomial equations due to the sixteenth 
century Italian mathematicians Niccolo Fontana Tartaglia (1499-1557) 
and Lodovico Ferrari (1522-1565), respectively, which were published 
by Girolamo Cardano (1501-1576) in 1545 in his Artis magnae sive de 
regulis algebraicis liber unus. In any case, if you have been led to believe 


Ty 


that similar expressions involving radicals (roots of sums of products of 
coefficients) will supply the solution to any polynomial equation, then 
you should brace yourself for a surprise: no such closed formula exists 
for a general polynomial equation of degree n when n > 5. It transpires 
that for each n > 5 there exists a polynomial equation of degree n with 
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integer coefficients which cannot be solved in terms of radicals;! such is, 
for example, x° — 4x — 2 = 0. 

Since there is no general formula for the solution of polynomial equa- 
tions, no general formula will exist for the solution of an arbitrary non- 
linear equation of the form f(x) = 0 where f is a continuous real-valued 
function. How can we then decide whether or not such an equation 
possesses a solution in the set of real numbers, and how can we find a 
solution? 

The present chapter is devoted to the study of these questions. Our 
goal is to develop simple numerical methods for the approximate solution 
of the equation f(x) = 0 where f is a real-valued function, defined and 
continuous on a bounded and closed interval of the real line. Methods 
of the kind discussed here are iterative in nature and produce sequences 
of real numbers which, in favourable circumstances, converge to the 
required solution. 


1.2 Simple iteration 


Suppose that f is a real-valued function, defined and continuous on a 
bounded closed interval [a,b] of the real line. It will be tacitly assumed 
throughout the chapter that a < b, so that the interval is nonempty. We 
wish to find a real number € € [a,b] such that f(€) = 0. If such € exists, 
it is called a solution to the equation f(a) = 0. 

Even some relatively simple equations may fail to have a solution in 
the set of real numbers. Consider, for example, 


fixe a+. 


Clearly f(x) = 0 has no solution in any interval [a,b] of the real line. 
Indeed, according to (1.1), the quadratic polynomial x?+1 has two roots: 
v1 = J—-1=2 and x2 = ——1 = —2. However, these belong to the set 
of imaginary numbers and are therefore excluded by our definition of 
solution which only admits real numbers. In order to avoid difficulties 
of this kind, we begin by exploring the existence of solutions to the 
equation f(#) = 0 in the set of real numbers. Our first result in this 
direction is rather simple. 
1 This result was proved in 1824 by the Norwegian mathematician Niels Henrik Abel 
(1802-1829), and was further refined in the work of Evariste Galois (1811-1832) 


who clarified the circumstances in which a closed formula may exist for the solution 
of a polynomial equation of degree n in terms of radicals. 
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Theorem 1.1 Let f be a real-valued function, defined and continuous 
on a bounded closed interval [a,b] of the real line. Assume, further, that 
f(a) f(b) < 0; then, there exists € in [a,b] such that f(€) = 0. 


Proof If f(a) = 0 or f(b) = 0, then € = a or € = b, respectively, and the 
proof is complete. Now, suppose that f(a) f(b) 4 0. Then, f(a) f(b) < 0; 
in other words, 0 belongs to the open interval whose endpoints are f(a) 
and f(b). By the Intermediate Value Theorem (Theorem A.1), there 
exists € in the open interval (a,b) such that f(€) = 0. 


To paraphrase Theorem 1.1, if a continuous function f has opposite 
signs at the endpoints of the interval [a,b], then the equation f(x) = 0 
has a solution in (a,b). The converse statement is, of course, false. 
Consider, for example, a continuous function defined on [a,b] which 
changes sign in the open interval (a,b) an even number of times, with 
f(a) f(b) A 0; then, f(a) f(b) > 0 even though f(x) = 0 has solutions 
inside [a,b]. Of course, in the latter case, there exist an even number 
of subintervals of (a,b) at the endpoints of each of which f does have 
opposite signs. However, finding such subintervals may not always be 
easy. 

To illustrate this last point, consider the rather pathological function 
1 1 
2 1+M|x—1.05|’ 
depicted in Figure 1.1 for x in the closed interval [0.8, 1.8] and M = 200. 
The solutions 7; = 1.05 —(1/M) and x2 = 1.054 (1/M) to the equation 
f(x) = 0 are only a distance 2/M apart and, for large and positive M, 
locating them computationally will be a challenging task. 


fi av (1.2) 


Remark 1.1 Jf you have access to the mathematical software package 
Maple, plot the function f by typing 


plot (1/2-1/(1+200*abs(x-1.05)), x=0.8..1.8, y=-0.5..0.6); 


at the Maple command line, and then repeat this experiment by choosing 
M = 2000, 20000, 200000, 2000000, and 20000000 in place of the num- 
ber 200. What do you observe? For the last two values of M, replot the 
function f for x in the subinterval [1.04999, 1.05001]. 


An alternative sufficient condition for the existence of a solution to 
the equation f(z) = 0 is arrived at by rewriting it in the equivalent 
form «x — g(a) = 0 where g is a certain real-valued function, defined 
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7 ——_ 


a: | 


-0.2 4 


0.44 


Fig. 1.1. Graph of the function f: «++ 3 for x € [0.8, 1.8]. 


1 
T+200|2—1.05] 


and continuous on [a, b]; the choice of g and its relationship with f will 
be clarified below through examples. Upon such a transformation the 
problem of solving the equation f(a) = 0 is converted into one of finding 
€ such that € — g(€) = 0. 


Theorem 1.2 (Brouwer’s Fixed Point Theorem) Suppose that g 
is a real-valued function, defined and continuous on a bounded closed 
interval [a,b] of the real line, and let g(x) € [a,b] for all x © [a,)]. 
Then, there exists € in [a,b] such that € = g(€); the real number € is 
called a fixed point of the function g. 


Proof Let f(a) =a—4g(a). Then, f(a) = a— g(a) < 0 since g(a) € [a, }] 
and f(b) = b— g(b) > 0 since g(b) € [a,b]. Consequently, f(a) f(b) < 0, 
with f defined and continuous on the closed interval [a, 6]. By Theorem 
1.1 there exists € € [a,b] such that 0 = f(€) = € — g(f). 


Figure 1.2 depicts the graph of a function x +> g(a), defined and 
continuous on a closed interval [a,b] of the real line, such that g(x) 
belongs to [a,b] for all x in [a,b]. The function g has three fixed points 
in the interval [a,b]: the z-coordinates of the three points of intersection 
of the graph of g with the straight line y = z. 

Of course, any equation of the form f(x) = 0 can be rewritten in the 
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Fig. 1.2. Graph of a function g, defined and continuous on the interval [a, 6], 
which maps [a, }] into itself; g has three fixed points in [a, b]: the x-coordinates 
of the three points of intersection of the graph of g with y = x. 


equivalent form of 7 = g(x) by letting g(x) = «+ f(a). While there is no 
guarantee that the function g, so defined, will satisfy the conditions of 
Theorem 1.2, there are many alternative ways of transforming f(x) = 0 
into x = g(x), and we only have to find one such rearrangement with g 
continuous on [a,b] and such that g(x) € [a,}] for all x € [a, 6]. Sounds 
simple? Fine. Take a look at the following example. 


Example 1.1 Consider the function f defined by f(x) = e* — 2x —-1 
for x € [1,2]. Clearly, f(1) < 0 and f(2) > 0. Thus we deduce from 
Theorem 1.1 the existence of € in [1,2] such that f(€) = 0. 


In order to relate this example to Theorem 1.2, let us rewrite the equa- 
tion f(x) = 0 in the equivalent form x— g(a) = 0, where the function g is 
defined on the interval [1,2] by g(x) = In(2a +1); here (and throughout 
the book) In means log,. As g(1) € [1,2], g(2) € [1,2] and g is monotonic 
increasing, it follows that g(x) € [1,2] for all x € [1,2], showing that g 
satisfies the conditions of Theorem 1.2. Thus, again, we deduce the 
existence of € € [1,2] such that € — g(€) = 0 or, equivalently, f(€) = 0. 
We could have also rewritten our equation as 7 = (e”—1)/2. However, 
the associated function g: x +> (e* —1)/2 does not map the interval [1, 2] 
into itself, so Theorem 1.2 cannot then be applied. © 
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Although the ability to verify the existence of a solution to the equa- 
tion f(x) = 0 is important, none of what has been said so far provides 
a method for solving this equation. The following definition is a first 
step in this direction: it will lead to the construction of an algorithm for 
computing an approximation to the fixed point € of the function g, and 
will thereby supply an approximate solution to the equivalent equation 


f(z) =0. 


Definition 1.1 Suppose that g is a real-valued function, defined and 
continuous on a bounded closed interval (a, b] of the real line, and assume 
that g(x) € [a,b] for all x € [a,b]. Given that xo € [a,b], the recursion 
defined by 


Uk+1 = g(rx), b= 01 25 ’ (1.3) 


is called a simple iteration; the numbers x,, k > 0, are referred to as 
iterates. 


If the sequence (x;) defined by (1.3) converges, the limit must be a 
fixed point of the function g, since g is continuous on a closed interval. 
Indeed, writing € = limp. &%, we have that 


f= jim ees = im (ox) =o (im rn ) = 96) (1.4) 


where the second equality follows from (1.3) and the third equality is a 
consequence of the continuity of g. 

A sufficient condition for the convergence of the sequence (x) is pro- 
vided by our next result which represents a refinement of Brouwer’s 
Fixed Point Theorem, under the additional assumption that the map- 
ping g is a contraction. 


Definition 1.2 (Contraction) Suppose that g is a real-valued func- 
tion, defined and continuous on a bounded closed interval [a,b] of the 
real line. Then, g is said to be a contraction on [a,b] if there exists a 
constant L such that0 < I <1 and 


lo(z) —g(y)|<Llxe—-y| Va,ye€ [a,b]. (1.5) 


Remark 1.2 The terminology ‘contraction’ stems from the fact that 
when (1.5) holds with 0 < L <1, the distance | g(a) — g(y) | between the 
images of the points x, y is (at least 1/L times) smaller than the distance 
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|a2 —y| between x and y. More generally, when L is any positive real 
number, (1.5) is referred to as a Lipschitz condition.! 


Armed with Definition 1.2, we are now ready to state the main result 
of this section. 


Theorem 1.3 (Contraction Mapping Theorem) Let g be a real- 
valued function, defined and continuous on a bounded closed interval 
[a,b] of the real line, and assume that g(x) € [a,b] for all x € |a,)d]. 
Suppose, further, that g is a contraction on [a,b]. Then, g has a unique 
fixed point € in the interval [a,b]. Moreover, the sequence (x) defined 
by (1.3) converges to € as k > co for any starting value xo in [a,b]. 


Proof The existence of a fixed point € for g is a consequence of Theorem 
1.2. The uniqueness of this fixed point follows from (1.5) by contradic- 
tion: for suppose that g has a second fixed point, 7, in [a,b]. Then, 


l€-nl=|9(€)-g(m) |< LlE-nI, 


e., (1—L)|E-—n| <0. As 1—L > 0, we deduce that n = €. 

Let x be any element of [a,b] and consider the sequence (x) de- 
fined by (1.3). We shall prove that (x;,) converges to the fixed point €. 
According to (1.5) we have that 


| tx —§| =| g(te-1)- 9(6) |S Llae-1—-§|, k21, 
from which we then deduce by induction that 
|ee—t|SE*|to-é[,, B21. (1.6) 


As L € (0,1), it follows that limy_... L’ = 0, and hence we conclude 
that limp_so0 eae aoe €| = 0. 


Let us illustrate the Contraction Mapping Theorem by an example. 


Example 1.2 Consider the equation f(x) =0 on the interval [1,2] with 
f(x) =e*—2x-1, as in Example 1.1. Recall from Example 1.1 that this 
equation has a solution, €, in the interval [1,2], and € is a fixed point of 
the function g defined on [1,2] by g(a) = In(2az +1). 


1 Rudolf Otto Sigismund Lipschitz (14 May 1832, Kénigsberg, Prussia (now Kalin- 
ingrad, Russia) — 7 October 1903, Bonn, Germany) made important contributions 
to number theory, the theory of Bessel functions and Fourier series, the theory 
of ordinary and partial differential equations, and to analytical mechanics and 
potential theory. 
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Table 1.1. The sequence (x) defined by (1.8). 


Xk 


1.000000 
1.098612 
1.162283 
1.201339 
1.224563 
1.238121 
1.245952 
1.250447 
1.253018 
1.254486 
1.255323 
1.255800 


FPOUWOANDWUBRWNrF OO] 
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Now, the function g is defined and continuous on the interval [1,2], and g 
is differentiable on (1,2). Thus, by the Mean Value Theorem (Theorem 
A.3), for any x, y in [1,2] we have that 


l9(z) —9M)1= lo (M(@—y)1= lo Mle - yl (1.7) 
for some 7 that lies between x and y and is therefore in the interval 
[1,2]. Further, g/(z) = 2/(2% +1) and g’(x) = —4/(2x +1)”. As 


g(x) < 0 for all x in [1,2], g’ is monotonic decreasing on [1,2]. Hence 
g'(1) > g'(n) > g'(2), #2, g'(m) © [2/5, 2/3]. Thus we deduce from (1.7) 
that 


l9(a)-g@)|SLla-y| Va,ye€ [1,2], 
with L = 2/3. According to the Contraction Mapping Theorem, the 
sequence (2,) defined by the simple iteration 
Cpe1 = M(2a,41), k=0,1,2,..., (1.8) 


converges to € for any starting value Zp in [1,2]. Let us choose xo = 1, for 
example, and compute the next 11 iterates, say. The results are shown 
in Table 1.1. Even though we have carried six decimal digits, after 11 
iterations only the first two decimal digits of the iterates x, appear to 
have settled; thus it seems likely that € = 1.26 to two decimal digits. © 


You may now wonder how many iterations we should perform in (1.8) 
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to ensure that all six decimals have converged to their correct values. In 
order to answer this question, we need to carry out some analysis. 


Theorem 1.4 Consider the simple iteration (1.3) where the function 
g satisfies the hypotheses of the Contraction Mapping Theorem on the 
bounded closed interval [a,b]. Given xo € [a,b] and a certain tolerance 
€ > 0, let ko(e) denote the smallest positive integer such that xp is no 
more than € away from the (unknown) fixed point €, i.e., |v, — €| < «, 
for all k > ko(e). Then, 


In| a1 — 20 | — In(e(1 — Z)) 
BONS in@i/D) 


St (1.9) 


where, for a real number x, |x] signifies the largest integer less than or 
equal to x. 
Proof From (1.6) in the proof of Theorem 1.3 we know that 
jzz —E] < L* |zo-El, k>1. 
Using this result with k = 1, we obtain 
Ivo — €| = |to — 21 +21 -¢| 


<= |f@¢ —-24| + [22 - €| 
< |eo — £1| + L|x0 — €|. 


Hence 


1 
aaa oa 


ze x4|. 


By substituting this into (1.6) we get 
Lk 
-—L 
Thus, in particular, |v, — €| < € provided that 
1 
1-L 
On taking the (natural) logarithm of each side in the last inequality, we 
find that |x, — €| < e for all k such that 
os In |a1 — %o| — In(e(1 — L)) 
s In(1/L) 


jz — €| < i |z1 — Zo|- (1.10) 


L* |v, — ao| <e. 


Therefore, the smallest integer ko(e) such that ja, — €| < e for all 
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k > ko(e) cannot exceed the expression on the right-hand side of the 
inequality (1.9). 


This result provides an upper bound on the maximum number of 
iterations required to ensure that the error between the kth iterate x; 
and the (unknown) fixed point € is below the prescribed tolerance e. 
Note, in particular, from (1.9), that if Z is close to 1, then ko(e) may 
be quite large for any fixed ¢. We shall revisit this point later on in the 
chapter. 


Example 1.3 Now we can return to Example 1.2 to answer the ques- 
tion posed there about the maximum number of iterations required, with 
starting value X) = 1, to ensure that the last iterate computed is correct 
to six decimal digits. 


Letting « = 0.5 x 10~° and recalling from Example 1.2 that L = 2/3, the 
formula (1.9) yields kg(e) < [32.778918] + 1, so we have that ko(e) < 33. 
In fact, 33 is a somewhat pessimistic overestimate of the number of 
iterations required: computing the iterates x, successively shows that 
already x25 is correct to six decimal digits, giving € = 1.256431. © 


Condition (1.5) can be rewritten in the following equivalent form: 


wy 
with Z € (0,1), which can, in turn, be rephrased by saying that the 
absolute value of the slope of the function g does not exceed L € (0,1). 


Assuming that g is a differentiable function on the open interval (a, 6), 
the Mean Value Theorem (Theorem A.3) tells us that 


ee SL Wa,yelatl, cAy, 


se 7 ie yb) 
for some 7 that lies between x and y and is therefore contained in the 
interval (a, b). 
We shall therefore adopt the following assumption that is somewhat 
stronger than (1.5) but is easier to verify in practice: 


g is differentiable on (a,b) and 
(1.11) 


AL € (0,1) such that |g’(a)| < L for all x € (a,b). 


Consequently, Theorem 1.3 still holds when (1.5) is replaced by (1.11). 
We note that the requirement in (1.11) that g be differentiable is 
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indeed more demanding than the Lipschitz condition (1.5): for example, 
g(x) = |x| satisfies the Lipschitz condition on any closed interval of the 
real line, with L = 1, yet g is not differentiable at « = 0.1 

Next we discuss a local version of the Contraction Mapping Theorem, 
where (1.11) is only assumed in a neighbourhood of the fixed point &€ 
rather than over the entire interval [a, 6]. 


Theorem 1.5 Suppose that g is a real-valued function, defined and 
continuous on a bounded closed interval [a, b] of the real line, and assume 
that g(a) € [a,b] for all x € [a,b]. Let € = g(€) € [a,b] be a fixed point 
of g (whose existence is ensured by Theorem 1.2), and assume that g 
has a continuous derivative in some neighbourhood of € with |g'(€)| < 1. 
Then, the sequence (x,) defined by rp41 = g(az), k > 0, converges to € 
as k — o, provided that xo is sufficiently close to €. 


Proof By hypothesis, there exists h > 0 such that g’ is continuous in 
the interval [€—h,€+h]. Since |g’(&)| < 1 we can find a smaller interval 
Is = [€—6,€+6], where 0 < 6 < h, such that |g’(x)| < L in this interval, 
with L < 1. To do so, take L = 5(1 + |g’(€)|) and then choose 6 < h 
such that 


1 
2 


Ig’ (x) — 9’) < 5. — |9'(8)1) 
for all x in Is; this is possible since g’ is continuous at €. Hence, 
lg (z)| < |9'(w) — 9 (1 + Ig) S 30. — [9 + lg") = L 
for all x € Is. Now, suppose that x, lies in the interval Is. Then, 
Cr+ — § = 9(@e) — € = g(ex) — g() = (ee — €)9' (me) 


by the Mean Value Theorem (Theorem A.3), where 7, lies between x, 
and €, and therefore also belongs to Is. Hence |g’(m,)| < L, and 


|vrti — €| < Lire — €|. (1.12) 


This shows that x41 also lies in Is, and a simple argument by induction 
shows that if x9 belongs to Is, then all x,, k > 0, are in Js, and also 


jvx —E| << L*ao—€|, >. (1.13) 

Since 0 < L < 1 this implies that the sequence (x) converges to€. O 
1 If you are familiar with the concept of Lebesgue measure, you will find the following 
result, known as Rademacher’s Theorem, revealing. A function f satisfying 


the Lipschitz condition (1.5) on an interval [a,b] is differentiable on [a,b], except, 
perhaps, at the points of a subset of zero Lebesgue measure. 
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If the conditions of Theorem 1.5 are satisfied in the vicinity of a fixed 
point €, then the sequence (2) defined by the iteration rp41 = g(x), 
k > 0, will converge to € for any starting value xo that is sufficiently 
close to €. If, on the other hand, the conditions of Theorem 1.5 are 
violated, there is no guarantee that any sequence (a,) defined by the 
iteration @41 = g(x), k > 0, will converge to the fixed point € for 
any starting value xo near €. In order to distinguish between these two 
cases, we introduce the following definition. 


Definition 1.3 Suppose that g is a real-valued function, defined and 
continuous on the bounded closed interval [a,b], such that g(x) € [a, 
for all x € [a,b], and let € denote a fixed point of g. We say that € is 
a stable fixed point of g, if the sequence (x,) defined by the iteration 
Lp. = g(xz), k > 0, converges to € whenever the starting value xo is 
sufficiently close to €. Conversely, if no sequence (xz) defined by this 
iteration converges to € for any starting value xo close to €, except for 
Lo = &, then we say that € is an unstable fixed point of g. 


We note that, with this definition, a fixed point may be neither stable 
nor unstable (see Exercise 2). 

As will be demonstrated below in Example 1.5, even some very simple 
functions may possess both stable and unstable fixed points. Theorem 
1.5 shows that if g’ is continuous in a neighbourhood of €, then the 
condition |g’(&)| < 1 is sufficient to ensure that € is a stable fixed point. 
The case of an unstable fixed point will be considered later, in Theorem 
1.6. 

Now, assuming that € is a stable fixed point of g, we may also be in- 
terested in the speed at which the sequence (x,) defined by the iteration 
Le+1 = g(x~), k > 0, converges to €. Under the hypotheses of Theorem 
1.5, it follows from the proof of that theorem that 


k—00 lan — €| k—oo BEE 
Consequently, we can regard |g’(€)| € (0,1) as a measure of the speed of 


convergence of the sequence (2) to the fixed point €. 


Definition 1.4 Suppose that € = limp. xp. We say that the sequence 
(az) converges to € at least linearly if there exist a sequence (ex) of 
positive real numbers converging to 0, and u € (0,1), such that 


la, —El| Sex, k=0,1,2,..., and jim = =p. (1.15) 
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If (1.15) holds with u = 0, then the sequence (xz) is said to converge to 
€ superlinearly. 

If (1.15) holds with p € (0,1) and ex = |x, — €|, k = 0,1,2,..., then 
(a) 18 said to converge to € linearly, and the number p = — logy, pp is 
then called the asymptotic rate of convergence of the sequence. If 
(1.15) holds with w = 1 and ex = |x, — €|, k = 0,1,2,..., the rate of 
convergence is slower than linear and we say that the sequence converges 
to € sublinearly. 


The words ‘at least’ in this definition refer to the fact that we only 
have inequality in |w, —€| < ex, which may be all that can be ascertained 
in practice. Thus, it is really the sequence of bounds ¢, that converges 
linearly. 

For a linearly convergent sequence the asymptotic rate of convergence 
p measures the number of correct decimal digits gained in one iteration; 
in particular, the number of iterations required in order to gain one more 
correct decimal digit is at most [1/p] +1. Here [1/p] denotes the largest 
integer that is less than or equal to 1/p. 

Under the hypotheses of Theorem 1.5, the equalities (1.14) will hold 
with yw = |g’(€)| € [0,1), and therefore the sequence (x;,) generated 
by the simple iteration will converge to the fixed point € linearly or 
superlinearly. 


Example 1.4 Given that a is a fixed positive real number, consider the 
function g defined on the interval [0,1] by 


j= g—{ 14 (logs (1/2))/* F forO<a<1, 
0 forzx=0. 


As lim,-.0+ g(x) = 0, the function g is continuous on [0,1]. Moreover, g 
is strictly monotonic increasing on [0,1] and g(x) € [0,1/2] C [0,1] for 
all x in [0,1]. We note that € = 0 is a fixed point of g (cf. Figure 1.3). 
Consider the sequence (2,) defined by x41 = g(az), k > 0, with 
zo = 1. It is a simple matter to show by induction that 2, = 27*", 
k > 0. Thus we deduce that (x) converges to € = 0 as k > oo. Since 


- 1 forO<a<l, 
jim WRT! y= 4 fora=1, 
ie la 0 fora>tl, 


we conclude that for a € (0,1) the sequence (x;,) converges to € = 0 sub- 
linearly. For a = 1 it converges to € = 0 linearly with asymptotic rate 
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Fig. 1.3. Graph of the function g from Example 1.4 on the interval x € [0, 1] 
for (a) a = 1/2, (b) a=1, (c) a=2. 


p = —logig » = log;, 2. When a > 1, the sequence converges to the fixed 
point € = 0 superlinearly. The same conclusions could have been reached 
by showing (through tedious differentiation) that lim,—o+ g'(x) = p, 
with jz: as defined above for the various values of the parameter a. © 


For a linearly convergent simple iteration r,41 = g(2~), where g’ is 
continuous in a neighbourhood of the fixed point € and 0 < |g'(§)| < 1, 
Definition 1.4 and (1.14) imply that the asymptotic rate of convergence 
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of the sequence (xz) is p = — logy, |g’ (€)|. Evidently, a small value of 
\g'(€)| corresponds to a large positive value of p and will result in more 
rapid convergence, while if |g’(€)| < 1 but |g’(&)| is very close to 1, p will 
be a small positive number and the sequence will converge very slowly.! 

Next, we discuss the behaviour of the iteration (1.3) in the vicinity of 
an unstable fixed point €. If |g’(€)| > 1, then the sequence (x;,) defined 
by (1.3) does not converge to € from any starting value xo; the next 
theorem gives a rigorous proof of this fact. 


Theorem 1.6 Suppose that € = g(€), where the function g has a con- 
tinuous derivative in some neighbourhood of €, and let \g'(€)| > 1. Then, 
the sequence (x;,) defined by «p41 = g(ax), k > 0, does not converge to 
€ from any starting value x9, Xo F €. 


Proof Suppose that x9 4 €. As in the proof of Theorem 1.5, we can see 
that there is an interval [5 = [€—6,€+6], 6 > 0, in which |g’(x)| > D>1 
for some constant L. If x, lies in this interval, then 


|w+1 — €| = lg(ve) — 9(€)| = |(@e — §) 9'(me)| 2 Lien — €1, 


for some 7, between x, and €. If xp41 lies in Is the same argument 
shows that 


|tn42 — €| > Lrey1 — €| > L*|ay, — €|, 


and so on. Evidently, after a finite number of steps some member of 
the sequence %p41, ©%42, 2k+3,--- Must be outside the interval Js, since 
L > 1. Hence there can be no value of ko = ko(6) such that |x, — €| < 6 
for all k > ko, and the sequence therefore does not converge to &. 


Example 1.5 In this example we explore the simple iteration (1.8) for 
g defined by 

g(x) = 5(a* +0) 
where c € R is a fixed constant. 


The fixed points of the function g are the solutions of the quadratic 
equation 2? — 2% +c =0, which are 1+ ,/(1 —c). If c > 1 there are no 
solutions (in the set R of real numbers, that is!), if c = 1 there is one 
solution in R, and if c < 1 there are two. 


1 Thus 0 < p < 1 corresponds to slow linear convergence and p > 1 to fast linear 
convergence. It is for this reason that we defined the asymptotic rate of conver- 
gence p, for a linearly convergent sequence, as — logy pt (or — logyg |g’ (€)|) rather 
than ps (or |g'(€)| ) - 
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Suppose now that c < 1; we denote the solutions by €) = 1— ./(1—c) 
and £2 = 1+/(1—c), so that £) < 1 < &). We see at once that g/(x) = 2, 
so the fixed point €2 is unstable, but that the fixed point €; is stable 
provided that —3 <c < 1. In fact, it is easy to see that the sequence 
(a,) defined by the iteration 7,41 = g(xz), & > 0, will converge to & 
if the starting value xo satisfies —fg < x < &2. (See Exercise 1.) If 
c is close to 1, g/(&1) will also be close to 1 and convergence will be 
slow. When c = 0, £; = 0 so that convergence is superlinear. This is an 
example of quadratic convergence which we shall meet later. © 


The purpose of our next example is to illustrate the concept of asymp- 
totic rate of convergence. According to Definition 1.4, the asymptotic 
rate of convergence of a sequence describes the relative closeness of suc- 
cessive terms in the sequence to the limit € as k — oo. Of course, for 
small values of k the sequence may behave in quite a different way, and 
since in practical computation we are interested in approximating the 
limit of the sequence by using just a small number of terms, the asymp- 
totic rate of convergence may sometimes give a misleading impression. 


Example 1.6 In this example we study the convergence of the sequences 
(up) and (vz) defined by 


Uk+1 = gi(ue), k=0,1,2,..., uo = 1, 
Uk-+1 = g2(Uk), k= 0:4, Qos vo = 1, 
where 


x 


Ge) 000 land Gale) = apioyi0. 


Each of the two functions has a fixed point at € = 0, and we easily find 
that g/ (0) = 0.99, g5(0) = 1. Hence the sequence (ux) is linearly con- 
vergent to zero with asymptotic rate of convergence p = — log; 0.99 = 
0.004, while Theorem 1.5 does not apply to the sequence (v,). It is quite 
easy to show by induction that vg, = (k + 1)~!°, so the sequence (v;) 
also converges to zero, but since limp.o0(Ug41/Uz) = 1 the convergence 
is sublinear. This means that, in the limit, (u,) will converge faster than 
(vz~). However, this is not what happens for small k, as Table 1.2 shows 
very clearly. 

The sequence (v;,) has converged to zero correct to 6 decimal digits 
when k = 4, and to 10 decimal digits when k = 10, at which stage ux 
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Table 1.2. The sequences (ux) and (vz) in Example 1.6. 


k Uk Uk 
0 1.000000 ~=1.000000 
1 0.990000 0.000977 
2 0.980100 0.000017 
3 0.970299 0.000001 
4 0.960596 0.000000 
5 0.950990 0.000000 
6 0.941480 0.000000 
7 0.932065 0.000000 
8 0.922745 0.000000 
9 0.913517 0.000000 
10 0.904382 0.000000 


is still larger than 0.9. Although (u;) eventually converges faster than 
Vp, we find that uz = (0.99)* becomes smaller than vy, = (k + 1)71° 
when 
10 
k > ——— ln(k +1). 
inti 0.99) B+) 
This first happens when k = 9067, at which point uz and vz are both 
roughly 10~*°. In this rather extreme example the concept of asymptotic 
rate of convergence is not useful, since for any practical purposes (vz) 
converges faster than (ux). 2 


1.3 Iterative solution of equations 


In this section we apply the idea of simple iteration to the solution 
of equations. Given a real-valued continuous function f, we wish to 
construct a sequence (x,), using iteration, which converges to a solution 
of f(a) = 0. We begin with an example where it is easy to derive 
various such sequences; in the next section we shall describe a more 
general approach. 


Example 1.7 Consider the problem of determining the solutions of the 
equation f(x) =0, where f: vee? —x—2. 


Since f’(a) = e? — 1 the function f is monotonic increasing for positive 
x and monotonic decreasing for negative values of x. Moreover, 
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f(l) =e-3<0, 

f(2)=e-4>0, 
f(-1) =e"!-1<0, cit) 
f(-2) =e* >0 


Hence the equation f(x) = 0 has exactly one positive solution, which 
lies in the interval (1,2), and exactly one negative solution, which lies in 
the interval (—2,—1). This is illustrated in Figure 1.4, which shows the 
graphs of the functions x + e” and «++ «+2 on the same axes. We 
shall write €, for the positive solution and 2 for the negative solution. 


ty 


Fig. 1.4. Graphs of y =e” and y=242. 


The equation f(z) = 0 may be written in the equivalent form 
x = In(a +2), 

which suggests a simple iteration defined by g(x) = In(a + 2). We shall 
show that the positive solution £, is a stable fixed point of g, while & is 
an unstable fixed point of g. 

Clearly, g/(x) = 1/(a + 2), so 0 < g’(&1) < 1, since &; is the positive 
solution. Therefore, by Theorem 1.5, the sequence (x;) defined by the 
iteration 


Upe1 = n(ap, +2), b= 031,24 2.5 (1.17) 
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will converge to the positive solution, €, provided that the starting value 
xo is sufficiently close to it.1 As 0 < g’(&) < 1/3, the asymptotic rate 
of convergence of (x,) to & is certainly greater than logy, 3. 

On the other hand, g’(&2) > 1 since —2 < 2 < —1, so the sequence 
(az) defined by (1.17) cannot converge to the solution €). It is not 
difficult to prove that for 7p > € the sequence (x) converges to €; while 
if xp < & the sequence will decrease monotonically until x, < —2 for 
some k, and then the iteration breaks down as g(x;,) becomes undefined. 

The equation f(a) = 0 may also be written in the form x = e® — 2, 
suggesting the sequence (x;) defined by the iteration 


Tri =e"*—2, k=0,1,2,.... 


In this case g(x) = e* —2 and g/(x) = e”. Hence g' (£1) > 1, g' (£2) < e7, 
showing that the sequence (1,) may converge to £2, but cannot converge 
to €,. It is quite straightforward to show that the sequence converges to 
€9 for any ro < €, but diverges to +o0 when wo > &1. 

As a third alternative, consider rewriting the equation f(x) = 0 as 
x = g(x) where the function g is defined by g(x) = x(e” — x)/2; the 
fixed points of the associated iteration 7,41 = g(xz) are the solutions €, 
and of f(x) = 0, and also the point 0. For this iteration neither of the 
fixed points, €, or & 9, is stable, and the sequence (z;) either converges 
to 0 or diverges to +00. 

Evidently the given equation may be written in many different forms, 
leading to iterations with different properties. © 


1.4 Relaxation and Newton’s method 


In the previous section we saw how various ingenious devices lead to 
iterations which may or may not converge to the desired solutions of a 
given equation f(x) = 0. We would obviously benefit from a more gener- 
ally applicable iterative method which would, except possibly in special 
cases, produce a sequence (2%) that always converges to a required so- 
lution. One way of constructing such a sequence is by relaxation. 

1 In fact, by applying the Contraction Mapping Theorem on an arbitrary bounded 


closed interval [0, M] where M > 1, we conclude that the sequence (x,) defined 
by the iteration (1.17) will converge to €; from any positive starting value xo. 
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Definition 1.5 Suppose that f is a real-valued function, defined and 
continuous in a neighbourhood of a real number €. Relaxation uses the 
sequence (xp) defined by 


Le+1 = Le —Af (Le), k=0,1,2,... ; (1.18) 


where A # 0 is a fixed real number whose choice will be made clear below, 
and xo is a given starting value near €. 


If the sequence (x,) defined by (1.18) converges to €, then € is a 
solution of the equation f(a) = 0, as we assume that f is continuous. 

It is clear from (1.18) that relaxation is a simple iteration of the form 
Cr+. = g(a), k = 0,1,2,..., with g(a) = x — Af(x). Suppose now, 
further, that f is differentiable in a neighbourhood of €. It then follows 
that g'(x) = 1-—Af"’ (2) for all x in this neighbourhood; hence, if f(€) = 0 
and f’(€) #0, the sequence (a;,) defined by the iteration rx41 = g(@x), 
k =0,1,2,..., will converge to € if we choose to have the same sign as 
f'(€), to be not too large, and take zo sufficiently close to €. This idea 
is made more precise in the next theorem. 


Theorem 1.7 Suppose that f is a real-valued function, defined and 
continuous in a neighbourhood of a real number €, and let f(€) = 0. 
Suppose further that f’ is defined and continuous in some neighbourhood 
of €, and let f’(€£) #0. Then, there exist positive real numbers X and 
6 such that the sequence (x,) defined by the relaxation iteration (1.18) 
converges to € for any Xo in the interval [€ — 6,€ + 6]. 


Proof Suppose that f’(€) = a, and that a is positive. If f’(€) is neg- 
ative, the proof is similar, with appropriate changes of sign. Since ff’ 
is continuous in some neighbourhood of €, we can find a positive real 
number 6 such that f’(x) > $a in the interval [€—6,€+6]. Let M be an 
upper bound for f’(a) in this interval. Hence M > $a. In order to fix 
the value of the real number A, we begin by noting that, for any A > 0, 


1-AM <1-Af'(x)<1-dra, we [€-6,£+46]. 


We now choose X so that these extreme values are equal and opposite, 
i.e, 1—AM = —-V and 1 — a = ¥ for a suitable nonnegative real 
number J. There is a unique value of v for which this holds; it is given 
by the formula 

_2M—a 

~ IM+a’ 
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corresponding to 


4 
A= Mata’ 
On defining g(x) = x — Af (x), we then deduce that 
lg (z)|)<0<1, we [€-6,€+6]. (1.19) 


Thus we can apply Theorem 1.5 to conclude that the sequence (2) 
defined by the relaxation iteration (1.18) converges to €, provided that 
Xo is in the interval [€ — 6, +6]. The asymptotic rate of convergence of 
the relaxation iteration (1.18) to € is at least —log,) 0. 


We can now extend the idea of relaxation by allowing \ to be a contin- 
uous function of x in a neighbourhood of € rather than just a constant. 
This suggests an iteration 


Ck+1 = rp — A(xx)f (Lz), k=0,1,2,... ; 


corresponding to a simple iteration with g(x) = x — \(x) f(x). If the 
sequence (z,) converges, the limit € will be a solution of f(a) = 0, 
except possibly when \(€) = 0. Moreover, as we have seen, the ultimate 
rate of convergence is determined by g’(€). Since f(€) = 0, it follows 
that g’/(€) = 1—A(€)f’(€), and (1.19) suggest using a function A which 
makes 1 — A(€) f’(€) small. The obvious choice is A(x) = 1/f’(#), and 
leads us to Newton’s method.! 


Definition 1.6 Newton’s method for the solution of f(x) = 0 is defined 
by 


Leq1 = Te — ca i i= 0)15 2, <5.4 (1.20) 


with prescribed starting value xo. We implicitly assume in the defining 
formula (1.20) that f'(x,) 40 for all k > 0. 


1 Isaac Newton was born on 4 January 1643 in Woolsthorpe, Lincolnshire, England 
and died on 31 March 1727 in London, England. According to the calendar used 
in England at the time, Newton was born on Christmas day 1642, and died on 
21 March 1727: the Gregorian calendar was not adopted in England until 1752. 
Newton made revolutionary advances in mathematics, physics, astronomy and 
optics; his contributions to the foundations of calculus were marred by priority 
disputes with Leibniz. Newton was appointed to the Lucasian chair at Cambridge 
at the age of 27. In 1705, two years after becoming president of the Royal So- 
ciety (a position to which he was re-elected each year until his death), Newton 
was knighted by Queen Anne; he was the first scientist to be honoured in this 
way. Newton’s Philosophiae naturalis principia mathematica is one of the most 
important scientific books ever written. 
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Newton’s method is a simple iteration with g(x) = x — f(«)/f'(z). 
Its geometric interpretation is illustrated in Figure 1.5: the tangent to 
the curve y = f(x) at the point (xx, f(x%)) is the line with the equation 
y — f (ee) = f' (xn) (x — xx); it meets the z-axis at the point (741, 0). 


4y 


Fig. 1.5. Newton’s method. 


We could apply Theorem 1.5 to prove the convergence of this iteration, 
but since generally it converges much faster than ordinary relaxation it 
is better to apply a special form of proof. First, however, we give a 
formal definition of quadratic convergence. 


Definition 1.7 Suppose that € = limz_... xx. We say that the sequence 
(az) converges to € with at least order q > 1, if there exist a sequence 
(Ex) of positive real numbers converging to 0, and > 0, such that 


tel Sex, &=0,1,2,..., and lim Sa (1.21) 
OD k 


If (1.21) holds with ce, = |v, — €| fork = 0,1,2,..., then the sequence 


(az) is said to converge to € with order q. In particular, if q = 2, then 
we say that the sequence (a) converges to € quadratically. 
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We note that unlike the definition of linear convergence where ~ was 
required to belong to the interval (0, 1), all we demand here is that pp > 0. 
The reason is simple: when g > 1, (1.21) implies suitably rapid decay of 
the sequence (€,) irrespective of the size of pu. 


Example 1.8 Let c > 1 andq > 1. The sequence (xp) defined by 
sae ct k =0,1,2,..., converges to 0 with order q. 


Theorem 1.8 (Convergence of Newton’s method) Suppose that 
f is a continuous real-valued function with continuous second derivative 
f", defined on the closed interval Is = [€ — 6,€ + 6], 6 > 0, such that 
f(€) = 0 and f"(£) 4 0. Suppose further that there exists a positive 
constant A such that 


Va,y els. 


If |€ — ao| < h, where h is the smaller of 6 and 1/A, then the sequence 
(xz) defined by Newton’s method (1.20) converges quadratically to €. 


Proof Suppose that |€ — x,| < h = min{6,1/A}, so that x, € Is. Then, 
by Taylor’s Theorem (Theorem A.4), expanding about the point x, € Is, 


0 = (6) = flex) + (€— 20) f'(ex) + =" prom), (1.22) 


for some nx, between € and xx, and therefore in the interval Js. Recalling 
(1.20), this shows that 


WV 
(§ = ve)? f'n) | (1.23) 
2f! (rr) 
Since |€ — xz] < 4, we have |€ —xp41| < $|€ —ap|. As we are given that 
|€ — xo| < h it follows by induction that |€ — x;,| < 27*h for all k > 0; 
hence (2,) converges to € as k — oo. 
Now, 1x lies between € and xx, and therefore (7) also converges to € 
as k — oo. Since f’ and f” are continuous on Is, it follows from (1.23) 
that 


£— p41 = 


fim esl _ |") 

ko0 |t_e—€)? | 2f"(E) 
which, according to Definition 1.7, implies quadratic convergence of the 
sequence (az) to € with w= |f"(€)/2f(E)|, w € (0, A/2]. 


(1.24) 
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The conditions of the theorem implicitly require that f’(€) 4 0, for 
otherwise the quantity f”’(«) /f’(y) could not be bounded in a neighbour- 
hood of €. (See Exercises 6 and 7 for what happens when f’(€) = 0.) 

One can show that if f’(€) = 0 and we assume that f(x) has a con- 
tinuous third derivative, and require certain quantities to be bounded, 
then the convergence is cubic (i.e., convergence with order gq = 3). 

It is possible to demonstrate that Newton’s method converges over a 
wider interval, if we assume something about the signs of the derivatives. 


Theorem 1.9 Suppose that the function f satisfies the conditions of 
Theorem 1.8 and also that there exists a real number X, X > &, such 
that in the interval J = [€,X] both f' and f” are positive. Then, the 
sequence (x) defined by Newton’s method (1.20) converges quadratically 
to € from any starting value xg in J. 


Proof It follows from (1.23) that if x, € J, then 2,41 > € Moreover, 
since f’(x) > 0 on J, f is monotonic increasing on J. As f(€) = 0, it 
then follows that f(z) > 0 for € <a < X. Hence, € < wy41 < x, 
k > 0. Since the sequence (x;) is bounded and monotonic decreasing, 
it is convergent; let 7 = limz_... xp. Clearly, 7 € J. Further, passing to 
the limit k — oo in (1.20) we have that f(7) = 0. However, € is the only 
solution of f(a) = 0 in J, so 7 = €, and the sequence converges to €. 

Having shown that the sequence (x) converges, the fact that it con- 
verges quadratically follows as in the proof of Theorem 1.8. 


We remark that the same result holds for other possible signs of f’ 
and f” in a suitable interval J. (See Exercise 8.) The interval J does 
not have to be bounded; considering, for instance, f(x) = e* — x — 2 
from Example 1.7, it is clear that f’(x) and f”(a) are both positive in 
the unbounded interval (0,00), and the Newton iteration converges to 
the positive solution of the equation f(x) = 0 from any positive starting 
value Zo. 

Note that the definition of quadratic convergence only refers to the 
behaviour of the sequence for sufficiently large k. In the same example we 
find that the convergence of the Newton iteration from a large positive 
value of zo is initially very slow. (See Exercise 3.) The possibility of 
this early behaviour is often emphasised by saying that the convergence 
of Newton’s method is ultimately quadratic. 
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1.5 The secant method 


So far we have considered iterations which can be written in the form 
Ue+1 = g(Lz), k > 0, so that the new value is expressed in terms of the 
old one. It is also possible to define an iteration of the form x,41 = 
g(@~,2~-1), k > 1, where the new value is expressed in terms of two 
previous values. In particular, we shall consider two applications of 
this idea, leading to the secant method and the method of bisection, 
respectively. 


Remark 1.3 We note in passing that one can consider more general 
iterative methods of the form 


Ce = G(Lk, Lk-1,---,Lk—e); k=€,€+1,..., 


with € > 1 fixed; here, we shall confine ourselves to the simplest case 
when €=1 as this is already sufficiently illuminating. 


Using Newton’s method to solve a nonlinear equation f(a”) = 0 re- 
quires explicit knowledge of the first derivative f’ of the function f. 
Unfortunately, in many practical situations f’ is not explicitly available 
or it can only be obtained at high computational cost. In such cases, 
the value f’(x;,) in (1.20) can be approximated by a difference quotient; 
that is, 

f (te) — fers) 


Uk — Lk-1 


f' (tr) © 


Replacing f’(a,) in (1.20) by this difference quotient leads us to the 
following definition. 


Definition 1.8 The secant method is defined by 


Uk — Lk-1 


ee hed oe (125) 


Le = 2. — f (ex) ( 


where x9 and x, are given starting values. It is implicitly assumed here 


that f(a) — f(tp_-1) £0 for all k > 1. 


The method is illustrated in Figure 1.6. The new iterate x44 is 
obtained from x,-1 and x, by drawing the chord joining the points 
P(@p-1, f (@p-1)) and Q(xz, f(x~)), and using as x41 the point at which 
this chord intersects the x-axis. If x,_1 and xx are close together and f 
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R 


Fig. 1.6. Secant method. 


is differentiable, 7,4, is approximately the same as the value supplied 
by Newton’s method, which uses the tangent at the point Q. 


Theorem 1.10 Suppose that f is a real-valued function, defined and 
continuously differentiable on an interval I = [€ —h,E +h], h > 0, 
with centre point €. Suppose further that f(€) = 0, f’(§) 4 0. Then, 
the sequence (xz) defined by the secant method (1.25) converges at least 
linearly to € provided that xq and x1 are sufficiently close to &. 


Proof Since f’(€) 4 0, we may suppose that f’(€) = a > 0; only minor 
changes are needed in the proof when f’(€) is negative. Since f’ is 
continuous on J, corresponding to any ¢ > 0 we can choose an interval 
Is = [€ —6,€ + 6], with 0 <6 <h, such that 
lf(z)-al<e, wells. (1.26) 
Choosing ¢ = 4a we see that 
0< 3a< f(t) <3a, «wels. (1.27) 


From (1.25) and using the Mean Value Theorem (Theorem A.3) together 
with the fact that f(€) = 0, we obtain 


= _ (Ke — §) f(x) 
ct ams 0) 


(1.28) 
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Table 1.3. Comparison of the secant method and Newton’s method for 
the solution of e* —x-2=0. 


Secant method Newton’s method 


0 1.000000 1.000000 
1 3.000000 1.163953 
2 1.036665 1.146421 
3 1.064489 1.146193 
4 1.153299 1.146193 
5 1.145745 
6 1.146191 
7 1.146193 


where V; is between x, and €, and yx lies between xz, and x,_1. Hence, 
if ep_-1 € Is and xx, € Is, then also Uy € Is and yr € Is. Therefore, 


5a/4| 5 
Je—aenl < lea ft S278] = Fle ad. (1.20) 
Thus, x,41 € Is and the sequence (x;) converges to € at least linearly, 
with rate at least log,9(3/2), provided that xp € Is and 2 € Is. O 
In fact, it can be shown that 
Gai [tet — 61 _ Ul (1.30) 


k—oo eae — €|9 


where ju is a positive constant and g = $(1+ /5) © 1.6, so that the 
convergence of the sequence (a,) to € is faster than linear, but not as 
fast as quadratic. (See Exercise 10.) 

This is illustrated in Table 1.3, which compares two iterative methods 
for the solution of f(a) = 0 with f: x + e* —x—2; the first is the secant 
method, starting from zo = 1, x; = 3, while the second is Newton’s 
method starting from a9 = 1. 

This experiment shows the faster convergence of Newton’s method, 
but it must be remembered that each iteration of Newton’s method 
requires the calculation of both f(a,) and f’(a,), while each iteration 
of the secant method requires the calculation of f(a;,) only (as f(ap—1) 
has already been computed). In our examples the computations are 
quite trivial, but in a practical situation the calculation of each value of 
f(a,) and f’(#,) may demand a substantial amount of work, and then 
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each iteration of Newton’s method is likely to involve at least twice as 
much work as one iteration of the secant method. 


1.6 The bisection method 


Suppose that f is a real-valued function defined and continuous on a 
bounded closed interval [a,b] of the real line and such that f(€) = 0 
for some € € [a,b]. A very simple iterative method for the solution of 
the nonlinear equation f(x) = 0 can be constructed by beginning with 
an interval [ao, 09] which is known to contain the required solution € 
(e.g., one may choose [ao, bo] as the interval [a, b] itself, with ap = a and 
bp = b), and successively halving its size. 

More precisely, we proceed as follows. Let k > 0, and suppose that it is 
known that f(a,) and f(b,;) have opposite signs; we then conclude from 
Theorem 1.1 that the interval (ax, b,) contains a solution of f(x) = 0. 
Consider the midpoint cz, of the interval (az, by) defined by 


Ck = $ (ax + de), (1.31) 


and evaluate f(c;,). If f(c,) is zero, then we have located a solution € 
of f(x) = 0, and the iteration stops. Else, we define the new interval 
(An41,0K41) by 


(padi aa) = eae 7 f (ck) f (bx) >0, (1.32) 
(cK, bx) if f (cr) f (be) <0, 
and repeat this procedure. 

This may at first seem to be a very crude method, but it has some 
important advantages. The analysis of convergence is trivial; the size of 
the interval containing € is halved at each iteration, so the sequence (cx) 
defined by the bisection method converges linearly, with rate p = log, 2. 
Even Newton’s method may often converge more slowly than this in the 
early stages, when the starting value is far from the desired solution. 
Moreover, the convergence analysis assumes only that the function f is 
continuous, and requires no bounds on the derivatives, nor even their 
existence.! Once we can find an interval [ao, bo] such that f(ao) and 
f(bo) have opposite signs, we can guarantee convergence to a solution, 
and that after k iterations the solution € will lie in an interval of length 
! Consider, for example, solving the equation f(x) = 0, where the function f is 

defined by (1.2). Even though f is not differentiable at the point x = 1.05, the 


bisection method is applicable. It has to be noted, however, that for functions of 
this kind it is not always easy to find an interval [ao, bo] in which f changes sign. 
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Fig. 1.7. Bisection; from the initial interval [ao, bo] the next interval is [ao, col, 
but starting from [ao, 69] the next interval is [cG, bo]. 


(by — ao)/2*. The bisection method is therefore very robust, though 
Newton’s method will always win once the current iterate is sufficiently 
close to €. 

If the initial interval [ao, 69] contains more than one solution, the limit 
of the bisection method will depend on the positions of these solutions. 
Figure 1.7 illustrates a possible situation, where [ao, bo] contains three 
solutions. Since f(co) has the same sign as f(b9) the second interval is 
[a9, Co], and the sequence (c,) of midpoints defined by (1.31) converges 
to the solution €. If however the initial interval is [ao, bj] the sequence 
of midpoints converges to the solution &3. 


1.7 Global behaviour 


We have already seen how an iteration will often converge to a limit 
if the starting value is sufficiently close to that limit. The behaviour 
of the iteration, when started from an arbitrary starting value, can be 
very complicated. In this section we shall consider two examples. No 
theorems will be stated: our aim is simply to illustrate various kinds of 
behaviour. 
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First consider the simple iteration defined by 
Ue¢1 = G(e~), k=0,1,2,..., where g(z)=aaz(1—2), (1.33) 


which is often known as the logistic equation. We require the constant 
a to lie in the range 0 < a < 4, for then if the starting value zo is in 
the interval [0, 1], then all members of the sequence (x) also lie in [0, 1]. 
The function g has two fixed points: 7 = 0 and = 1-—1/a. The fixed 
point at 0 is stable if 0 < a < 1, and the fixed point at 1—1/a is stable 
if 1 <a< 3. The behaviour of the iteration for these values of a is what 
might be expected from this information, but for larger values of the 
parameter a the behaviour of the sequence (a;,) becomes increasingly 
complicated. 

For example, when a = 3.4 there is no stable fixed point, and from 
any starting point the sequence eventually oscillates between two values, 
which are 0.45 and 0.84 to two decimal digits. These are the two stable 
fixed points of the double iteration 


tey1= 9" (tk), 9° (t) = g(g(x)) = a?a(1—2)[1—ax(1—2)]. (1.34) 


When 3 <a<1+/6, the fixed points of g* are the two fixed points of 
g, that is 0 and 1 — 1/a, and also 


1 11 
(1 +—+4-[a? — 20-3] . (1.35) 
2 a a 


This behaviour is known as a stable two-cycle (see Exercise 12). 

When a > 1+ \/6 all the fixed points of g* are unstable. For example, 
when a = 3.5 all sequences (xx) defined by (1.33) tend to a stable 4-cycle, 
taking successive values 0.50, 0.87, 0.38 and 0.83. 

For larger values of the parameter a the sequences become chaotic. 
For example, when a = 3.99 there are no stable fixed points or limit- 
cycles, and the members of any sequence appear random. In fact it can 
be shown that for such values of a the members of the sequence are 
dense in a subinterval of [0,1]: there exist real numbers a and 8, a < {, 
such that any subinterval of (a, 3), however small, contains an infinite 
subsequence of (a;,). For the value a = 3.99 the maximal interval (a, 3) 
is (0.00995, 0.99750) to five decimal digits. Starting from ap = 0.75 we 
find that the interval (0.70, 0.71), for example, contains the subsequence 


T16,%164, 0454, 0801, U812,--- - (1.36) 


The sequence does not show any apparent regular behaviour. The cal- 
culation is extremely sensitive: if we replace xp by xp + 6x9, and write 
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Fig. 1.8. Global behaviour of Newton’s method. 


xt, + 6xx for the resulting perturbed value of xz, it is easy to see that 
O62K41 = a(1 = 2x p,) OLE ; 


provided that the changes 6x, are so small that a(éx,)? can be ig- 
nored. With x9 = 0.75 as above we find from the same calculation that 
62812/6x9 is about 1073!, so that to determine xg12 with reasonable ac- 
curacy it is necessary to carry through the whole calculation using 250 
decimal digits. 

Our second example, of more practical importance, is of Newton’s 
method applied to a function f with several zeros. The example is 


f(x) = a(x? — 1)(a — 3) exp(—3(x — 1)’); (1.37) 


the graph of the function is shown in Figure 1.8. The function has zeros 
at —1,0, 1 and 3. The sequence generated by the Newton iteration will 
converge to one of these solutions if the starting value is fairly close to it. 
Moreover, the geometric interpretation of the iteration shows that if the 
starting point is sufficiently large in absolute value the iteration diverges 
rapidly to oo; the iteration behaves as if the function had a zero at 
infinity, and the sequence can be loosely described as ‘converging to 00’. 
With this interpretation some numerical experimentation soon shows 
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that from any starting value Newton’s method eventually converges to a 
solution, which might be too. However, it is certainly not true that the 
sequence converges to the solution closest to the starting point; indeed, 
if this were true, no sequence could converge to oo. It is easy to see why 
the behaviour is much more complicated than this. 

The Newton iteration converges to the solution at 0 from any point in 
the interval (—0.327,0.445). As we see from Figure 1.8, the iteration will 
converge exactly to 0 in one iteration if we start from the x-coordinate 
of any of the points a1, a2 and a3; at each of these three points the 
tangent to the curve passes through the origin. Since f is continuous, 


this means that there is an open interval surrounding each of these points 
from which the Newton iteration will converge to 0. The maximal such 
intervals are (—1.555, —1.487), (1.735, 1.817) and (3.514, 3.529) to three 
decimal digits. In the same way, there are several points at which the 
tangent to the curve passes through the point (Ai,0), where A; is the 
z-coordinate of the point a,. Starting from one of these points, the 
Newton iteration will evidently converge exactly to the solution at 0 in 
two steps; surrounding each of these points there is an open interval 
from which the iteration will converge to 0. 

Now suppose we define the sets S,,, m = —1,0,1,3,00,—oo, where 
Sm consists of those points from which the Newton iteration converges 
to the zero at m. Then, an extension of the above argument shows 
that each of the sets S,, is the union of an infinite number of disjoint 
open intervals. The remarkable property of these sets is that, if € is a 
boundary point of one of the sets S,,,, then it is also a boundary point of 
all the other sets as well. This means that any neighbourhood of such a 
point €, however small, contains an infinite number of members of each 
of the sets S,,. For example, we have seen that the iteration starting 
from any point in the interval (—0.327,0.445) converges to 0. We find 
that the end of this interval lies between 0.4457855 and 0.4457860; Table 
1.4 shows the limits of various Newton iterations starting from points 
near this boundary. Each of these points is, of course, itself surrounded 
by an open interval which gives the same limit. 


1.8 Notes 


Theorem 1.2 is a special case of Brouwer’s Fixed Point Theorem. Luitzen 
Egbertus Jan Brouwer (1881-1966) was professor of set theory, function 
theory and axiomatics at the University of Amsterdam, and made major 
contributions to topology. Brouwer was a mathematical genius with 
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Table 1.4. Limit of Newton’s method near a boundary point. 


ro Limit 
0.4457840 0 
0.4457845 0 
0.4457850 0 
0.4457855 0 
0.4457860 1 
0.4457865 —0o 
0.4457870 -1 
0.4457875 —1 


0.4457880 —oo 
0.4457885 —oo 
0.4457890 +00 
0.4457895 3 
0.4457900 1 


strong mystical and philosophical leanings. For an historical overview of 
Brouwer’s life and work we refer to the recent book of Dirk Van Dalen, 
Mystic, Geometer, and Intuitionist. The Life of L.E.J. Brouwer: the 
Dawning Revolution, Clarendon Press, Oxford, 1999. 

The Contraction Mapping Theorem, as stated here, is a simplified ver- 
sion of Banach’s fixed point theorem. Stefan Banach! founded modern 
functional analysis and made outstanding contributions to the theory 
of topological vector spaces, measure theory, integration, the theory of 
sets, and orthogonal series. For an inspiring account of Banach’s life 
and times, see R. Kaluza, Through the Eyes of a Reporter: the Life of 
Stefan Banach, Birkhauser, Boston, MA, 1996. 

In our definitions of linear convergence and convergence with order q, 
we followed Definitions 2.1 and 2.2 in Chapter 4 of 


» WALTER GAUTSCHI, Numerical Analysis: an Introduction, Birkhauser, 
Boston, MA, 1997. 


Exciting surveys of the history of Newton’s method are available in T. 
Ypma, Historical development of the Newton—Raphson method, SIAM 
Rev. 37, 531-551, 1995, H. Goldstine, History of Numerical Analysis 
from the Siateenth through the Nineteenth Century, Springer, New York, 
1977; and in Chapter 6 of Jean-Luc Chabert (Editor), A History of Algo- 
rithms from the Pebble to the Microchip, Springer, New York, 1999. As 


1 30 March 1892, Krakow, Austria-Hungary (now in Poland) — 31 August 1945, 
Lvov, Ukraine, USSR (now independent). 
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is noted in these sources, Newton’s De analysi per aequationes numero 
terminorum infinitas, probably dating from mid-1669, is sometimes re- 
garded as the historical source of the method, despite the fact that, 
surprisingly, there is no trace in this tract of the familiar recurrence re- 
lation tp41 = te — f(xr)/f'(@~) bearing Newton’s name, nor is there 
a mention of the idea of derivative. Instead, the paper contains an ex- 
ample of a cubic polynomial whose roots are found by purely algebraic 
and rather complicated substitutions. In 1690, Joseph Raphson (1648- 
1715) in the Preface to his Analysis aequationum universalis describes 
his version of Newton’s method as ‘not only, I believe, not of the same 
origin, but also, certainly, not with the same development’ as Newton’s 
method. Further improvements to the method, and its form as we know 
it today, were given by Thomas Simpson in his Essays in Mathematicks 
(1740). Simpson presents it as ‘a new method for the solution of equa- 
tions’ using the ‘method of fluxions’, ¢.e., derivatives. It is argued in 
Ypma’s article that Simpson’s contributions to this subject have been 
underestimated, and ‘it would seem that the Newton—Raphson—Simpson 
method is a designation more nearly representing facts of history of this 
method which lurks inside millions of modern computer programs and 
is printed with Newton’s name attached in so many textbooks’. 

The convergence analysis of Newton’s method was initiated in the 
first half of the twentieth century by L.V. Kantorovich.! More recently, 
Smale,? Dedieu and Shub,? and others have provided significant insight 
into the properties of Newton’s method. A full discussion of the global 
behaviour of the logistic equation (1.33), and other examples, will be 
found in P.G. Drazin, Nonlinear Systems, Cambridge University Press, 
Cambridge, 1992, particularly Chapters 1 and 3. 

The secant method is also due to Newton (cf. Section 3 of Ypma’s 
paper cited above), and is found in a collection of unpublished notes 
termed ‘Newton’s Waste Book’ written around 1665. 

In this chapter, we have been concerned with the iterative solution of 
equations for a real-valued function of a single real variable. In Chapter 
4, we shall discuss the iterative solution of nonlinear systems of equations 
1 L.V. Kantorovich, Functional analysis and applied mathematics, Uspekhi Mat. 

Nauk 3, 89-185, 1948; English transl., Rep. 1509, National Bureau of Standards, 

Washington, DC, 1952. 

2 Steve Smale, Newton’s method estimates from data at one point, in The Merging 
of Disciplines: New Directions in Pure, Applied and Computational Mathematics, 

R. Ewing, K. Gross, C. Martin, Eds., Springer, New York, 185-196, 1986. 


3 Jean-Pierre Dedieu and Michael Shub, Multihomogeneous Newton methods, Math. 
Comput. 69 (231), 1071-1098, 2000. 
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of the form f(a) = 0 where f: R" — R”. There, corresponding to the 
case of n = 2, we shall say more about the solution of equations of the 
form f(z) =0 where f is a complex-valued function of a single complex 
variable z. 

This chapter has been confined to generally applicable iterative meth- 
ods for the solution of a single nonlinear equation of the form f(x) = 0 
for a real-valued function f of a single real variable. In particular, we 
have not discussed specialised methods for the solution of polynomial 
equations or the various techniques for locating the roots of polynomi- 
als in the complex plane and on the real line (by Budan and Fourier, 
Descartes, Hurwitz, Lobachevskii, Newton, Schur and others), although 
in Chapter 5 we shall briefly touch on one such polynomial root-finding 
method due to Sturm.! For a historical survey of the solution of polyno- 
mial equations and a review of recent advances in this field, we refer to 
the article of Victor Pan, Solving a polynomial equation: some history 
and recent progress, SIAM Rev. 39, 187-220, 1997. 


Exercises 


Le The iteration defined by 2,41 = (aj + ¢), where 0 <c <1, 
has two fixed points €,, 2, where 0 < €; <1 < &9. Show that 


trai — 1 = 3 (ae t+ &1) (ae — 41), k= 0,1, 255055 


and deduce that limp... ©, = & if 0 < a < €. How does the 
iteration behave for other values of x9? 

1.2 Define the function g by g(0) = 0, g(x) = —# sin?(1/x) for 
0<a<1. Show that g is continuous, and that 0 is the only 
fixed point of g in the interval [0, 1]. By considering the iteration 
Cnt = g(@n), n = 0,1,2,..., starting, first from x9 = 1/(kz), 
and then from 2 = 2/((2k + 1)m), where k is an integer, show 
that according to Definition 1.3 the critical point is neither sta- 
ble nor unstable. 

1.3 Newton’s method is applied to the solution of 


e* —x—-—2=0. 


1 For further details in this direction, we refer to M.A. Jenkins and J.F. Traub, 
A three-stage algorithm for real polynomials using quadratic iterations, SIAM J. 
Numer. Anal. 7, 545-566, 1970, A.S. Householder, The Numerical Treatment of 
a Single Nonlinear Equation, McGraw-Hill, New York, 1970, and A. Ralston and 
P. Rabinowitz, A First Course in Numerical Analysis, Second Edition, McGraw— 
Hill, New York, 1978. 
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1.4 


1.5 


1.6 


1 Solution of equations by iteration 


Show that if the starting value is positive, the iteration converges 
to the positive solution, and if the starting value is negative it 
converges to the negative solution. Obtain approximate expres- 
sions for x, if (i) vo = 100 and (ii) a = —100, and describe the 
subsequent behaviour of the iteration. About how many iter- 
ations would be required to obtain the solution to six decimal 
digits in these two cases? 


Consider the iteration 


Pe)" , k=0,1,2,..., 
f (tr + f(tr)) — fee) 
for the solution of f(x) = 0. Explain the connection with New- 
ton’s method, and show that (x,) converges quadratically if ao 
is sufficiently close to the solution. Apply this method to the 
same example as in Example 1.7, f(x) = e” — x — 2, and verify 


Tk+1 = Lk 


quadratic convergence beginning from vo = 1. Experiment with 
calculations beginning from 29 = 10 and from xq = —10, and 
account for their behaviour. 


It is sometimes said that Newton’s method converges quadrati- 
cally, and therefore in the successive approximations to the so- 
lution the number of correct digits doubles each time. Explain 
why this is not generally correct. Suppose that f”(x) is defined 
and continuous in a neighbourhood of € and that x, agrees with 
the solution € to m decimal digits; give an estimate of the num- 
ber of correct decimal digits in rp41. 

Illustrate your estimate by using Newton’s method to deter- 
mine the positive zero of f(x) = e” — x — 1.000000005, which is 
close to 0.0001; use xp = 0.0005. 


Suppose that f(€) = f’(€) = 0, so that f has a double root at €, 
and that f” is defined and continuous in a neighbourhood of €. 
If (a,) is a sequence obtained by Newton’s method, show that 


(€ = 2e)°f" (ne) f" (1k) 

f'(xx) f' (xr) ’ 
where n, and yx both lie between € and xz. Suppose, further, 
that 0 < m < |f’(x)| < M for all x in the interval [€ — 6,€ + 6] 
for some 6 > 0, where M < 2m; show that if zo lies in this 
interval the iteration converges to €, and that convergence is 


= 3(€— zx) 


€— fk = —3 


1.7 


1.8 


1.9 
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linear, with rate log,)2. Verify this conclusion by finding the 
solution of e” = 1+ a, beginning from zp = 1. 


Extend the result of the previous exercise to a case where f has 
a triple root at €, so that f() = f’(€) = f”(€) = 0. 


Suppose that the function f has a continuous second derivative, 
that f(€) = 0, and that in the interval [X,&], with X < €, 
f'(z) > 0 and f’(x) < 0. Show that the Newton iteration, 
starting from any 2p in [X, &], converges to €. 

The secant method is used to determine solutions of the equation 
x? —1=0. Starting from zp = 1 +e, 21 = —1+ €, show that 
ar = $¢+O(e”), and determine x3, x4 and x5, neglecting terms 
of order O(e). Explain why, at least for sufficiently small values 
of €, the sequence (,) converges to the solution —1. 

Repeat the calculation with x9 and x; interchanged, so that 
to = —l+eand x; = 1+ ¢, and show that the sequence now 
converges to the solution 1. 

Write the secant iteration in the form 


we f(@e-1) — fe-1 f (zx) 

f(e-1)— f(te) ” 
Supposing that f has a continuous second derivative in a neigh- 
bourhood of the solution € of f(x) = 0, and that f’(€) > 0 and 
f’(€) > 0, define 


k =1,2,3,.... 


Ck+1 = 


P(@k,Lk-1) = isha ' 
(xe — €)(%e-1 — €) 


where «41 has been expressed in terms of x, and x,_1. Find 


an expression for 
W(ep-1) = lim v(xp, @p-1), 
LpE 


and then determine lim;,_,—.¢ )(%x-1). Deduce that 


lim _ Plt ks er) = f"(€)/2f'(€). 


Ck,Ck—-1 > 
Now assume that 


lim lena — 8] A 


k—o0o |vp _ €|¢ a 


Show that g — 1 -—1/q = 0, and hence that gq = $(1+ \/5). 
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1 Solution of equations by iteration 


Deduce finally that 


tim @eti él _ ( i) i ve 
k-00 |xp — €|9 2f"(f) , 

A variant of the secant method defines two sequences uz and 

vu, such that all the values f(u,), & =0,1,2,..., have one sign, 

and all the values f(v,), k = 0,1,2,..., have the opposite sign. 

From the numbers uz and vz the secant formula is used to define 


Uf (Ue) — Unf (Ur) 
f(vr) — f(ur)” 
we define up4+1 = Wk, Ve+1 = Ue if f(we) has the same sign as 
f(ux), and otherwise up41 = Up, Ueti = We. Suppose that f” 
is defined and continuous on the interval [uo, vo], and that, for 
some Kk, f” has constant sign in [ux,v«K]. Explain, graphically 


Wh = k=0,1,2,...; 


or otherwise, why either uz, = ux for all k > K, or vy = vK 
for all k > K. Deduce that the method converges linearly, and 
determine the asymptotic rate of convergence; explain clearly 
what you mean by convergence of this method. What advan- 
tages, if any, do you think this method has compared with the 
secant method of Definition 1.8? 
A two-cycle of the iteration defined by the function g is a pair 
of distinct numbers a,b such that b = g(a) and a = g(b). Use 
the fact that a and b are fixed points of the iteration defined by 
the function h(x) = g(g(x)) to give a definition of stability for 
a two-cycle. Show that if |g’(a) g’(b)| < 1, then the two-cycle is 
stable, and that if |g’(a) g'(b)| > 1 the two-cycle is not stable. 

Show that if a,b is a two-cycle for Newton’s method for the 
function f, and if | f(a) f(b) f(a) f""(b)| < [f’(a) f’(b)]?, then the 
two-cycle is stable. 

Show that Newton’s method for the solution of f(x) = 0 with 


f: v1 x(a? — 1) 


has a two-cycle of the form a, —a, and find the value of a; is this 
two-cycle stable? 
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Solution of systems of linear equations 


2.1 Introduction 
In Chapter 1 we considered the solution of a single equation of the form 
f(x) = 0 where f is a real-valued function defined and continuous on 
a closed interval of the real line. The simplest example of this kind is 
the linear equation az = b where a and Db are given real numbers, with 
a # 0, whose solution is 
=a. *b, (2.1) 

trivially. Of course, we could have expressed the solution as x = b/a as 
in Chapter 1, but, as you will see in a moment, writing x = a~'b is much 
more revealing in the present context. In this chapter we shall consider 
a different generalisation of this elementary problem: 

Let A be an n x n matrix with a,j; as its entry in row 7 and column j 

and 6 a given column vector of size n with jth entry b,;; 

find a column vector 2 of size n such that Ax = b. 


Denoting by x; the ith entry of the vector x, we can also write Ax = b 
in the following expanded form: 


a4, 21 + ay2%2 + +++ + any = by ’ 


Q21%1 1 A22%2 T+ * + AQnIn = be, 


(2.2) 


An1XL1 + An2%2 +++ ++ Anntn = bn : 


Recall that in order to ensure that for real numbers a and b the single 
linear equation ax = b has a unique solution, we need to assume that 
a #0. In the case of the simultaneous system (2.2) of n linear equations 
in n unknowns we shall have to make an analogous assumption on the 
matrix A. 

To do so, we introduce the following definition. 
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Definition 2.1 The set of allmxn matrices with real entries is denoted 
by R™*X". A matrix of sizen x n will be called a square matrix of order 
n, or simply a matriz of order n. The determinant of a square matrix 
AER"™” is the real number det(A) defined as follows: 


det(A) = yy sign(V1,V2,---,Un)Q1v;,A2v_ +--+ Anv, + 
perm 

The summation is over all n! permutations (V1, V2,...,Un) of the integers 
1,2,...,n, and sign(1,,2,...,U%,) = +1 or —1 depending on whether the 
n-tuple (V1,V2,...,Un) 1s an even or odd permutation of (1,2,...,n), 
respectively. An even (odd) permutation is obtained by an even (odd) 
number of exchanges of two adjacent elements in the array (1,2,...,7). 
A matric A € R"*” is said to be nonsingular when its determinant 
det(A) is nonzero. 


The inverse matrix A~! of a nonsingular matrix A € R"*" is defined 
as the element of R”*” such that A~1'A = AA7! = I, where I is the 
n x n identity matrix 


1 0... 0 

ilerere 
pee pede (2.3) 

00...1 


In order to find an explicit expression for A~! in terms of the elements of 
the matrix A, we recall from linear algebra that, for each i = 1,2,...,n, 


det(A) ift=k, 


0 ifitk, (2) 


aii Agi + aigApe + +++ + Gin Akn = { 
where A;; = (—1)'t/Cof(a;;) and Cof(a;;), called the cofactor of aj;, 
is the determinant of the (n — 1) x (n — 1) matrix obtained by erasing 
from A € R”*” row i and column j. Then, it is a trivial matter to show 
using (2.4) that A~! has the form 


Ai Agi Ant 
1 Ai2 A A 
a a 12 4422 n2 
=P TEUCSIIN loaek cs aca: (25) 
Ain Aon Ann 


Having found an explicit formula for the matrix A~!, we now multiply 
both sides of the equation Ax = b on the left by A~! to deduce that 


2.1 Introduction Al 
A-!(Ax) = A7'b; finally, since A~!(Axv) = (A71A)a2 = Ix = @, it 
follows that 
x= Aq'b, (2.6) 
where the inverse A~! of the nonsingular matrix A is given in terms of 
the entries of A by (2.5).1 


An alternative approach to the solution of the linear system Aa = b, 
called Cramer’s rule, proceeds by expressing the ith entry of x as 


x, = D;/D, $12 hag, 


where D = det(A), and D; is the nxn determinant obtained by replacing 
the ith column of D by the entries of b. Evidently, we must require that 
A is nonsingular, i.e., that D = det(A) # 0. Thus, all we need to do 


to solve Ax = b is to evaluate the n + 1 determinants D,D1,...,Dn, 
each of them n x n, and check that D = det(A) is nonzero; the final 
calculation of the elements 27;,i = 1,2,...,n, is then trivial.” 


The purpose of our next example is to illustrate the application of 
Cramer’s rule. 


Example 2.1 Suppose that we wish to solve the system of linear equa- 


tions 
Ly © w= 6, 
2%, +472 +2273 = 16, 
L1+5x2 —4%3 =—3. 
The solution of such a small system can easily be found in terms of 


determinants, by Cramer’s rule. This gives 
ej = Di/D,. t= Ds/D, zg= D3/D, 


| By the way, on comparing (2.6) with (2.1) you will notice that (2.1) is a special 
case of (2.6) when n = 1. 

2 Gabriel Cramer (31 July 1704, Geneva, Switzerland — 4 January 1752, Bagnols- 
sur-Céze, France). In the 1730s Colin Maclaurin (February 1698, Kilmodan, 
Cowal, Argyllshire, Scotland — 14 June 1746, Edinburgh, Scotland) wrote his Trea- 
tise of Algebra which was not published until 1748, two years after his death. It 
contained the first published results on determinants proving Cramer’s rule for 
2 x 2 and 3 x 3 systems and indicating how the 4 x 4 case would work. Cramer 
gave the general rule for n x n systems without proof in the Appendix to his paper 
‘Introduction to the analysis of algebraic curves’ (1750), motivated by the desire 
to find the equation of a plane curve passing through a number of given points. 
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where 
111 1 1 
D=| 2 4 2}, D,=|]16 4 2}, 
—1 5-4 5-4 


with similar expressions for Dz and D3. To obtain the solution we 
therefore need to evaluate four determinants. © 


Now you may think that since, for A € R”*” nonsingular, we have 
expressed the solution to Aw = b in the ‘closed form’ 


a2=A'b 


and have even found a formula for A~! in terms of the coefficients of A, 
or may simply compute the entries of a directly using Cramer’s rule, the 
story about the simultaneous set of linear equations (2.2) has reached 
its happy ending. We are sorry to disappoint you: a disturbing tale is 
about to unfold. 

Imagine the following example: let n = 100, say, and suppose that you 
have been given all 10000 entries of a 100 x 100 matrix A, together with 
the entries of a 100-component column vector b. To avoid trivialities, 
let us suppose that none of the entries of A or b is equal to 0. Question: 
Does the linear system Ax = b have a solution? If it does, how would 
you find, say, the 53rd entry of the solution vector x? Of course, you 
could calculate the determinant of A and check whether it is equal to 
zero; if not, you could then calculate the determinant D53 obtained by 
replacing the 53rd column of A by the vector b, and the required result, 
by Cramer’s rule, is then the ratio of these two determinants. How much 
time do you think you would need to accomplish this task? An hour? 
A day? A month? 

IT imagine that you do not have a large enough sheet of paper in front 
of you to write down this 100 x 100 matrix. Let us therefore start with 
a somewhat simpler setting. Assume that n is any integer, n > 2, and 
denote by d, the number of arithmetic operations that are required to 
calculate det(A) for A € R"*”. For example, for a 2 x 2 matrix, 


det (A) = a11a22 — a12421 ; 


this evaluation requires 3 arithmetic operations — 2 multiplications and 
1 subtraction — giving dj = 3. In general, we can calculate det(A) by 
expanding it in the elements of its first row. This requires multiplying 
each of the n elements in the first row of A by a subdeterminant of size 
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n — 1 (a total of n(d,_1 + 1) operations) and summing the n resulting 
numbers (another n — 1 operations). Thus, 


dn =N(dn-1+1)+n—-1, n>3, dy = 8. (2.7) 


Let us write d, = c,n! and substitute this into (2.7) to obtain 


1 1 3 


Cn = Cn—-1 +2 


Now, summing (2.8) from n = 3 to k for k > 3 yields, on letting 0! = 1, 


As 9 (1/n!) =e, it follows that 
lim cy =e. 
k—- oo 


Thus,! d, ~ en! as n — oo. In order to compute the solution of a 
system of n simultaneous linear equations by Cramer’s rule we need to 
evaluate n+ 1 determinants, each of size n x n, so the total number of 
operations required is about (n+ 1)d, ~e(n+ 1)! as n > oo. 

For n = 100, this means approximately 101! e = 2.56 x10! arithmetic 
operations.? Today’s fastest parallel computers are capable of teraflop 
speeds, i.e., 10!% floating point operations per second; therefore, the 
computing time for our solution would be around 2.56 x 10/6°/101? = 
2.56 x 10'4° seconds, or a staggering 8.11 x 10!4° years. According 
to the prevailing theoretical position, the Universe began in a violent 
explosion, the Big Bang, about 12.5(+3) x 10° years ago. So please put 
that large sheet of paper away quickly! We need to discover a more 
efficient approach. 

Incidentally, you might notice that in the expansion of all the deter- 
minants involved in Cramer’s rule all the smaller subdeterminants occur 


many times over, so the number of operations involved can be reduced 
by avoiding such repetitions. However, a more careful analysis shows 


1 For two sequences (ay) and (bn), we shall write an ~ bn if limn—oo(an/bn) = 1. 

2 While on the subject of calculating factorials of large integers, let us mention 
Stirling’s formula which states that n! ~ V2mnt1/26-" asn — co (J. Stirling, 
Methodus differentialis, 1730). Stirling’s approximation can be made more precise 
as the double inequality 


Vann tt/2e—nt1/(12n+1) <ni< Jannrth/2e—nt1/(12n) 


(H. Robbins, A remark on Stirling’s formula Amer. Math. Monthly 62, 26-29, 
1955). 
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that we cannot by this means reduce the total by more than a factor of 
about n, which hardly affects our conclusion. 

Our other approach to solving Aw = b, based on computing A~! from 
(2.5) and writing « = A~'b, is equally inefficient: in order to compute 
the inverse of an n xn matrix A using determinants, one has to calculate 
the determinant of A as well as n? determinants of size n—1 each of which 
then has to be divided by det(A), requiring a total of approximately 


en! +n?e(n—1)!+n? ~e(n+1)! 


arithmetic operations, just the same as before. 

The aim of this chapter is to develop alternative methods for the solu- 
tion of the system of linear equations Ax = b. We begin by considering 
a classical technique, Gaussian elimination. We shall then explore its 
relationship to the factorisation A = LU of the matrix A where L is 
lower triangular and U is upper triangular. It will be seen that by using 
the Gaussian elimination the number of arithmetic operations required 
to solve the linear system Ax = b with an n x n matrix A is approx- 
imately 3n? — a dramatic reduction from the O(e(n + 1)!) operation 
count associated with matrix inversion using determinants.” 

We conclude the chapter with a discussion of another classical idea 
attributed to Gauss:? the least squares method for the solution of the 
system of linear equations Ax = b where A € R™*"”, a is the column 
vector of unknowns of size n and 6 a given column vector of size m. 


2.2 Gaussian elimination 


The technique for solving systems of linear algebraic equations that we 
shall describe in this section was developed by Carl Friedrich Gauss and 
was first published in his Theoria motus corporum coelestium in section- 
ibus conicis solem ambientium (1809), a major two-volume treatise on 
the motion of celestial bodies. Gauss was concerned with the study of 


1 Carl Friedrich Gauss (30 April 1777, Brunswick, Duchy of Brunswick, Holy Roman 
Empire (now Germany) — 23 February 1855, Gottingen, Hanover, Germany) made 
outstanding contributions to mathematics, physics and astronomy. He gave the 
first proof, in 1799, of the Fundamental Theorem of Algebra. Gauss worked in 
differential geometry, number theory, algebra and non-Euclidean geometry. 

2 Note, for example, that 21008 = 0.67 x 10° < 10l!e = 2.56 x 106°, Ona 

computer that performs 10! floating operations a second a calculation requiring 

10° operations via Gaussian elimination would take 10~® seconds, as opposed to 

the 8.11 x 10!4° years using Cramer’s rule or formula (2.5). 

See, however, the bibliographical notes at the end of the chapter about the priority 

dispute between Legendre and Gauss. 
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the asteroid Pallas, and derived a set of six linear equations with six 
unknowns, also giving a systematic method for its solution. 

The method proceeds by successively eliminating the elements below 
the diagonal of the matrix of the linear system until the matrix becomes 
triangular, when the solution of the system is very easy. This technique 
is now known under the name Gaussian elimination. 

Before we embark on the general description of Gaussian elimination, 
let us illustrate its basic steps through a simple example; this is the same 
as Example 2.1 above, written out again for convenience. 


Example 2.2 Consider the system of linear equations 


a+ t+ 273 = 6, 
221 Te 4x2 + 223 = 16, 
cet’ On 5x2 — 4x3 = -3. 


It is convenient to rewrite this in the form Ax = b where A € R®*? and 
az and b are column vectors of size 3; thus, 


1 A A vy 6 
a re) zo |=| 16 |. (2.9) 
a4. he v3 =8 


We begin by adding the first row, multiplied by —2, to the second row, 
and adding the first row to the third row, giving the new system 


i a a x} 6 
02 0 vy |= [ 4 |. (2.10) 
O68 x3 3 


The newly created 0 entries in the first column have been typeset in 
italics. Now adding the new second row, multiplied by —3, to the third 
row, we find 


tb, 4 xy 6 
O° 8:0 a J=[{ 4], (2.11) 
00 -3 3 -9 


1 The idea of this elimination process was already known to the Chinese two thou- 
sand years ago. The book Jiu zhang suan shu (English translation, by K. Shen 
et al.: The Nine Chapters on the Mathematical Art, Oxford University Press, 
1999) contained an example of the elimination for a system of five equations with 
five unknowns. This book was very influential in the history of Chinese mathe- 
matics, and is the earliest specialised mathematical work in China that survived 
to the present day. Although it is unclear when its mathematical content was 
produced, it is estimated that the book was assembled during the Han dynasty in 
the first century AD. 
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which can easily be solved for the unknowns in the reverse order, begin- 
ning with x3 = 3. © 
Each of these successive row operations can be expressed as a multipli- 
cation on the left of the matrix A € R"*”, n > 2 (in our example n = 3), 
of the system of linear equations by a transformation matrix. Writing 
ES) for the n x n matrix whose only nonzero element is e,;; = 1, we 
see that the product 


(I+ plrsE))A (2.12) 


is the same as the original matrix A, except that the elements of row s, 
multiplied by a real number j4,;, have been added to the corresponding 
elements of row r. Here J denotes the n x n identity matrix defined by 
(2.3). In the elimination process we always add a multiple of an earlier 
row to a later row in the matrix, so that 1 < s < r <n in (2.12); the 
transformation matrix I + ji,;E‘"*) is therefore lower triangular in the 
following sense. 


Definition 2.2 Let n be an integer, n > 2. The matric L € R"*” is said 
to be lower triangular if 1,; =0 for everyi andj withl<i<j<n. 
The matrix L € R"*” is called unit lower triangular if it is lower 
triangular, and also the diagonal elements are all equal to unity, that is 
li; =1 fori =1,2,...,n. 


Thus the matrix I+ p1,,E*) € R"*” appearing in (2.12) is unit lower 
triangular if 1 < s <r <n, and the above elimination process can be 
expressed by multiplying A on the left successively by the unit lower 
triangular matrices [+ pu,,.E*) for r = s+1,...,nands=1,...,n—1, 
with p-s € R; there are $n(n-—1) of these matrices, one for each element 
of A below the diagonal (since there are n elements on the diagonal and, 
therefore, 1 +--+ (n—1) = $(n? —n) elements below the diagonal). 
The next theorem lists the technical tools which are required for proving 
that the resulting product is a lower triangular matrix. 


Theorem 2.1 The following statements hold for any integer n > 2: 


(i) the product of two lower triangular matrices of order n is lower 
triangular of order n; 
(ii) the product of two unit lower triangular matrices of order n is 
unit lower triangular of order n; 
(iii) a lower triangular matrix is nonsingular if, and only if, all the 


2.2 Gaussian elimination 47 


diagonal elements are nonzero; in particular, a unit lower trian- 
gular matrix is nonsingular; 

(iv) the inverse of a nonsingular lower triangular matrix of order n 
is lower triangular of order n; 

(v) the inverse of a unit lower triangular matriz of order n is unit 
lower triangular of order n. 


Proof The proofs of parts (i), (ii), (iii) and (v) are very straightforward, 
and are left as an exercise. 

Part (iv) is proved by induction; it is easily verified for a nonsingular 
lower triangular matrix of order 2, using (2.5). Let n > 2, suppose that 
(iv) is true for all nonsingular lower triangular matrices of order k, with 
2<k<n, and let LD be a nonsingular lower triangular matrix of order 
k +1. Both L and its inverse L~! can be partitioned by their last row 


and column: 
= Ly 0 -1_ xX Yy 
w=(e a): #=(e B): 


where L, is a nonsingular lower triangular matrix of order k and X € 
R***. @ and @ are real numbers and r, z and y are column vectors of 
size k. Since the product LL~! is the identity matrix of order k+1, we 
have 


IjX=k, Liy=0, r'X+az7=0", r'yt+aG=1; 


here I, signifies the identity matrix of order k. Thus X = TS, which 
is lower triangular of order k by the inductive hypothesis, and y = 0 
given that LZ, is nonsingular; the remaining two equations determine z 
and @ on noting that a 4 0 (given that L is nonsingular). This shows 
that L~' is lower triangular of order k + 1, and the inductive step is 
complete; consequently, (iv) is true for any n > 2. 


We shall also require the concept of upper triangular matrix. 


Definition 2.3 Let n be an integer, n > 2. The matriz U € R"*” is said 
to be upper triangular if u;; = 0 for everyi andj withl<j<i<n. 


We note that results analogous to those in the preceding theorem 
concerning lower triangular matrices are also valid for upper triangular 
matrices (replacing the words ‘lower triangular’ by ‘upper triangular’ 
throughout). 
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uy 


Fig. 2.1. LU factorisation of A € R"*": A = LU. The matrix L € R"*” is 
unit lower triangular and U € R”™” is upper triangular. 


The elimination process for A € R"*” may now be written as follows: 


LwyLw-1)-.-L@A=U, N= 4n(n-1), (2.13) 


where U € R”*” is an upper triangular matrix and each of the matrices 
Dy) € R°*", 7 =1,...,.N, is unit lower triangular of order n and has 
the form [+ fixg B®) with 1 <s<r<vn, where J is the identity matrix 
of order n. That is, 


Lay = T+pniE®?, Ly = T+pa ES, ..., Lin) = Lip er), 
It is easy to see that E(9) B®) = 6,,E*), where 


§.= 1 forr=s, 
re" | 0 forr#s 


is known as the Kronecker delta.! Thus, for 1 < s < r < n, the inverse 
of the matrix I + p,,E"*) is the lower triangular matrix I — p,,E®), 
which corresponds to the subtraction of row s, multiplied by y,;,, from 
row r. Hence 


=i -1 
AST len l= EU, (2.14) 


where L, as the product of a finite number of unit lower triangular 


matrices of order n, is itself unit lower triangular of order n by Theorem 
2.1(ii); see Figure 2.1. 


2.3 LU factorisation 


Having seen that the Gaussian elimination process gives rise to the fac- 
torisation A = LU of the matrix A € R"*”, n > 2, where L is unit 


1 Leopold Kronecker (7 December 1823, Liegnitz, Prussia, Germany (now Legnica, 
Poland) — 29 December 1891, Berlin, Germany) made significant contributions to 
the theory of elliptic functions, the theory of ideals and the algebra of quadratic 
forms. 
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lower triangular and U is upper triangular, we shall now show how to 
calculate the elements of L and U directly. Equating the elements of A 
and LU we conclude that 


aig = So ligttey ; l<ijgcn. (2.15) 


Recalling that LD and U are lower and upper triangular respectively, we 
see that, in fact, the range of & in this sum extends only up to min{?, 7}, 
the smaller of the numbers 7 and 7. Taking the two cases separately 
gives 


Gey = SS ety l<j<i<n, (2.16) 
k=1 
k=1 


Rearranging these equations, and using the fact that 1;; = 1 for all 
i=1,2,...,n, we find that 


1 = 
lig = {20 Staoas eo) Oa 


U 
a hat P= i=, as) 
i-1 
Wg = dig — > lintins , ae eee 
= P= tea pts (2.19) 


with the convention that sums over empty index sets are equal to zero. 
Thus, the elements of U in the first row are uj; = aij, 7 = 1,2,...,n, 
and the elements of L in the first column are 1; = 1 and Jj, = aj, /ui1, 
b= Qe bn: 

The equations (2.18) and (2.19) can now be used for the calculation 
of the elements J;; and u,;;. For each value of 7, starting with i = 2, we 
calculate first l,j, for 7 = 1,...,7 — 1 in order, and then the values of 
uij, for 7 =72,...,n, again in increasing order. We then move on to the 
same calculation for i+ 1, and so on until 2 = n. In the calculation of 
l,j; we need the values of uzj, 1 < k < 7 < 2-1, from previous rows, 
and we also need the values of lj,, 1 < k < 7 —1, in the same row but in 
previous columns; a similar argument applies to the calculation of u;;. 
When carried out in this order, all the values required at each step have 
already been calculated. 

Of course, we must ensure that the calculation does not fail because 
of division by zero; this requires that none of the uj;, 7 =1,...,n—1, 
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in the formula (2.18) is zero. To investigate this possibility we use the 
properties of certain submatrices of A. 


Definition 2.4 Suppose that A € R"*” with n > 2, and let 1<k<n. 
The leading principal submatrix of order k of A is defined as the 
matric A® € R*X® whose element in row i and column j is equal to the 
element of the matrix A in row i and column j for1<i,7 <k. 


Armed with this definition, we can now formulate the main result 
of this section. It provides a sufficient condition for ensuring that the 
algorithm (2.18), (2.19) for calculating the entries of the matrices L and 
U in the LU factorisation A = LU of a matrix A € R”*” does not break 
down due to division by zero in (2.18). 


Theorem 2.2 Let n > 2, and suppose that A € R"*” is such that 
every leading principal submatrix A“ © R*X* of A of order k, with 
1<k <n, is nonsingular. (Note that A itself is not required to be 
nonsingular.) Then, A can be factorised in the form A = LU, where 
LER"*” is unit lower triangular and U € R"*” is upper triangular. 


Proof The proof is by induction on the order n. Let us begin by verifying 
the statement of the theorem for n = 2. We intend to show that any 


2 x 2 matrix 
a Ob 
a-(o a): 


with a 4 0, is equal to the product of a unit lower triangular matrix L 
of order 2 and an upper triangular matrix U of order 2; that is, we wish 
to establish the existence of 


1 O ui ov 
rays Cy) 


such that LU = A, where m, u, v and 7 are four real numbers, to be 
determined. Equating the product LU with A, we deduce that 


u=a, v=b, mu=c, mu+n=d. 


Since a 4 0 by hypothesis, the first of these equalities implies that u 4 0 
also; hence m = c/u, v = b, and 7 = d— mv. Thus we have shown the 
existence of the required matrices L and U in R?*? and completed the 
proof for n = 2. 
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Now, suppose that the statement of the theorem has already been ver- 
ified for matrices of order k, 2 < k <n; suppose that A € R(&+)*x G+) 
and all leading principal submatrices of A of order & and less are non- 
singular. We mimic the proof in the case of n = 2 by partitioning A into 
blocks by the last row and column: 


where A) € R*** is a nonsingular matrix (all of whose leading princi- 
pal submatrices are themselves nonsingular), b, c are column vectors of 
size k, and d is a real number. According to our inductive hypothesis, 
there exist a unit lower triangular matrix L“*) of order k and an upper 
triangular matrix U“) of order k such that A® = LU“, Thus we 
shall seek the desired unit lower triangular matrix DL of order k+ 1 and 
the upper triangular matrix U of order k +1 in the form 


L&) oO U®) ay 
b= (Tr i) and u=( “hr a 


where m and v are column vectors of size k and 7 is a real number, to 
be determined from the requirement that the product LU be equal to 
the matrix A. On equating LU with A, we obtain 


L®u®) = A®, L&y=b, m™U®=cl, mTvine=d. 


? 


The first of these four equalities provides no new information. However, 
we can use the remaining three to determine the column vectors v and 
m and the real number 7. Since L) is unit lower triangular, its de- 
terminant is equal to 1; therefore L) is nonsingular. This means that 
the second equation uniquely determines the unknown column vector wv. 
Further, since by hypothesis A) is nonsingular and A“ = LU), 
we conclude that 


det(A)) = det(L®U)) = det(L™) det(U™) = det(U™) ; 


given that det(A“)) 4 0 by the inductive hypothesis, this implies that 
det(U\) ¥ 0 also, and therefore the third equation uniquely determines 
m. Having found v and m, the fourth equation yields 7 = d— mv. 
Thus we have shown the existence of the desired matrices L and U of 


order k + 1, and the inductive step is complete.! 


1 Tn the last paragraph we made use of the Binet-Cauchy Theorem which states 
that for three matrices A, B, C in R*** with A = BC, we have det(A) = 
det(B) det(C). This result was proved in 1812 independently by Augustin-Louis 
Cauchy (1789-1857) and Jacques Philippe Marie Binet (1786-1856). 
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2.4 Pivoting 


The aim of this section is to show that even if the matrix A does not 
satisfy the conditions of Theorem 2.2, by permuting rows and columns 
it can be transformed into a new matrix A of the same size so that A 
admits an LU factorisation. 


Example 2.3 Consider, for example, the system obtained from (2.9) 
by replacing the coefficient of x, in the first equation by zero. Then, 
the leading element in the matrix A is zero, the computation fails at the 
first step, and the LU factorisation of A does not exist. However if we 
interchange the first two equations we obtain a new matrix A which is 
the same as A but with the first two rows interchanged, 


24 2 
A={ 01 1]. (2.20) 
= ee 


Since the leading principal submatrices of order 1 and 2 of A are non- 
singular, by Theorem 2.2 the matric A now has the required LU factori- 
sation, which is easily computed. 


A computation which fails when an element is exactly zero is also likely 
to run into difficulties when that element is nonzero but of very small 
absolute value; the problem stems from the presence of rounding errors. 
The basic operation in the elimination process consists of multiplying 
the elements of one row of the matrix by a scalar p,,, and adding to 
the elements of another row. The multiplication operation will always 
introduce a rounding error, so the elements which are multiplied by u,., 
will already contain a rounding error from operations with earlier rows 
of the matrix; these errors will therefore themselves be multiplied by pu;s 
before adding to the new row. The errors will be magnified if |p| > 1, 
and will be greatly magnified if |j,.| >> 1. 

The accumulation of rounding errors alluded to in the previous para- 
graph can be alleviated by permuting the rows of the matrix. Thus, 
at each stage of the elimination process we interchange two rows, if ne- 
cessary, so that the largest element in the current column lies on the 
diagonal. This process is known as pivoting. Clearly, when pivoting is 
performed none of the multipliers w,; have absolute value greater than 
unity. The process is easily formalised by introducing permutation ma- 
trices. This leads us to our next definition. 
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Definition 2.5 Suppose that n > 2. A matrix P € R"*” in which 
every element is either 0 or 1, and whose every row and every column 
contain exactly one nonzero element, is called a permutation matrix. 


Example 2.4 Here are three of the possible 3! permutation matrices in 
R3x3: 


1 0 0 0 1 0 0 0 1 
0 1 0], 1 0 0 |, 1 0 0 
0 0 1 0 0 1 0 1 0 
The proof of our next result is elementary and is left to the reader. 


Lemma 2.1 Let n > 2 and suppose that P € R"*” is a permutation 
matrix. Then, the following statements hold: 


(i) given that I is the identity matrix of order n, the matrix P can 
be obtained from I by permuting rows; 

(ii) of Q € R"*” is another permutation matriz, then the products 
PQ and QP are also permutation matrices; 

(iii) let PS) € R"*” denote the interchange matrix, obtained from 
the identity matric I € R"*” by interchanging rows r and s; 
any interchange matrix is a permutation matrix; moreover, any 
permutation matriz of order n can be written as a product of 
interchange matrices of order n; 

(iv) the determinant of a permutation matriz P € R"*” is equal to 1 
or —1, depending on whether P is obtained from the identity ma- 
trix of order n by an even or odd number of permutations of rows, 
respectively; in particular, a permutation matrix is nonsingular. 


Now we are ready to prove the next theorem. 


Theorem 2.3 Let n > 2 and A € R"*". There exist a permutation 
matriz P, a unit lower triangular matrix L, and an upper triangular 
matrix U, all three in R"*", such that 


PA=LU. (2.21) 


Proof The proof is by induction on the order n. Let n = 2 and consider 


the matrix 
a b 
A= ; 
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If a 4 0, the proof follows from Theorem 2.2 with P taken as the 2 x 2 
identity matrix. If a = 0 but c 4 0, we take 


P=(1 0) 
ran(5 $)=(3 $)(6 fem 


If a=0 and c= 0, the result trivially follows by writing 


Co a}e(ot)(0 ae” 


and taking P as the 2 x 2 identity matrix. That completes the proof for 
n = 2. 

Now, suppose that A € R(*+)*(&+)) and assume that the theorem 
holds for every matrix of order k with 2 < k <n. We begin by locating 
the element in the first column of A which has the largest absolute value, 
or any one of them if there is more than one such element, and inter- 


and write 


change rows if required; if the largest element is in row r we interchange 
rows 1 and r. We then partition the new matrix according to the first 
row and column, writing 


T T T 
Gr)y_{ @ w = 1 0O a v 
> a=($ B ie: I 0 C v2) 


where a is the element of largest absolute value in the first column, 
B,C € R***, and p, w, m and v are column vectors of size k, with m, 
v and C to be determined. Writing out the product we find that 


ve = wt 
am = p, (2.23) 
C = B-mv'". 


If @ = 0, then the first column of A consists entirely of zeros (p = 0); in 
this case we can evidently choose m = 0, v = w and C = B. Suppose 
now that a # 0; then m = (1/a)p, so that all the elements of m 
have absolute value less than or equal to unity, since a is the largest in 
absolute value element in the first column. By the inductive hypothesis 
we can now write 


P*C = L*U*, (2.24) 
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where P*, L*, U* € R***, P* is a permutation matrix, L* is unit lower 
triangular, and U* is upper triangular. Hence, by (2.23), 


1 ot i. * OF a ut 
(Ir) 4 — 
Pm A= (4m )( pm a Coo) 225 
since P* P* = I. Now, defining the permutation matrix P by 
1 ot 
= (1r) 
P ( 0 pt )p ; (2.26) 
we obtain 
1 oT a vt 
PA= : 
Ce (GE) em 


which is the required factorisation of A €¢ R(*+)*@+)_ This completes 
the inductive step. The theorem therefore holds for every matrix of 
order n > 2. C 


The proof of this theorem also contains an algorithm for constructing 
the permutation matrix P, and the matrices DL and U. The permu- 
tation matrix is conveniently described by specifying the sequence of 
interchanges: given the n — 1 integers p1,p2,...,Pn—1, the matrix P is 
the product of the permutation matrices which interchange rows 1 and 
pi, 2 and po, and so on. 


2.5 Solution of systems of equations 


Consider the linear system Ax = b where A € R”*” and @ and b 
are column vectors of size n. According to Theorem 2.3 there exist a 
permutation matrix P € R"*”, a unit lower triangular matrix LD € R"*” 
and an upper triangular matrix U € R”*” such that PA = LU. Having 
obtained the LU factorisation of the matrix PA, the solution of the 
system of linear equations Ax = b is straightforward: multiplying both 
sides of Ax = b on the left by the permutation matrix P, we obtain that 


PAgx = Pb: (2.28) 


equivalently, LUa2 = Pb. On defining y = Ux we can rewrite (2.28) as 
the following coupled set of linear equations: 


Ly = Pb, Ux=y. (2.29) 


Assuming that the matrix P and the LU factorisation of PA are already 
known, there are three stages to the calculation of «x: 
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Step 1.First we apply the sequence of permutations to the vector }, 
to produce Pb; 

Step 2.[Forward substitution] We then solve the lower triangular 
system Ly = Pb, calculating the elements in the order y1, yo2,.-., Yn} 
Step 3. [Backsubstitution] Finally the required solution x is ob- 
tained from the upper triangular system Ux = y, calculating the 
elements of a in the reverse order, @, ©p—1,.--, 21. 


Step 3 will break down if any of the diagonal elements of U are zero, 
but if this happens the matrix A is singular. 

The next section is devoted to assessing the amount of computational 
work for this algorithm. 


2.6 Computational work 


In this section we shall show that the work involved in factorising an 
n X n matrix in the form A = LU is proportional to n°. An estimate of 
the amount of computational work of this kind is important in deciding 
in advance how long a calculation would take for a very large matrix, and 
is also useful in comparing different methods for the solution of a given 
problem. For example, in the next chapter we shall derive a method 
for solving a system of equations with a symmetric positive definite 
matrix; that method requires only half the amount of work involved 
in the standard LU factorisation algorithm which takes no account of 
symmetry. 

Accurate estimates of the time taken by a computation are very com- 
plicated and require some detailed knowledge of the computer being 
used. The estimates which we shall give are simple but crude; they 
are normally good enough for the types of comparisons we have just 
mentioned. 

We see from (2.18) that the calculation of 1;; requires j — 1 multipli- 
cations, 7 — 2 additions, 1 subtraction and 1 division, a total of 27 — 1 
operations. In the same way, (2.19) shows that the calculation of u;; 
requires 2i — 2 operations.! Recalling that, for any integer k > 2, 


1+---+k=$k(k+1) and 1? +---+k? = tk(k+1)(2k+1), 
we then deduce that the total number of operations involved in the LU 


1 We do not count the row interchanges in the number of ‘operations’. 


2.6 Computational work 57 


factorisation is 


Ye y+ G- y= In(n —1)(4n +1). 


It is enough to say that the number of multiplications required is about 
eae 
ae} 

Having constructed the factorisation we can now count the number 
of operations required to compute the vectors y and a in (2.29). Given 


the vector Pb, the elements of y are obtained from 


n?, for moderately large values of n. 


i-1 
we(Pbi- y= (P).= > yyy, AS 2S 40m, (230) 
j=l 


which requires 2i — 2 operations. Summing over 7 this gives a total of 
n(n —1). The calculation of the elements of x is similar: 


1 
a= — yi— >> wage, : 3 Cy eee oe (2.31) 


This requires 2(n — i) + 1 operations, giving a total of n?. 

The total number of operations involved in the solution of the system 
of equations is therefore approximately 3n° — 5n? for the factorisation, 
followed by n(n—1)+n? = 2n?—n for the solution of the two triangular 
systems, that is, approximately 2n3 + 3n?, ignoring terms of size O(n). 
We often need to solve a number of systems of this kind, all with dif- 
ferent right-hand sides, but with the same matrix. We then need only 
factorise the matrix once, and the total number of multiplications re- 
quired for k right-hand sides becomes approximately $n + (2k — 5) n?. 
When k is fairly large it might appear that it would be more efficient to 
form the inverse matrix A~!, and then multiply each right-hand side by 
the inverse; but we shall show that it is not so. 

To form the inverse matrix we first factorise the matrix A, and then 
solve n systems, with the right-hand sides being the vectors which consti- 
tute the columns of the identity matrix. Because these right-hand sides 
have a special form, there is the possibility of saving some work; some 
careful counting shows that the total can be reduced from 31n? + 2n? = 
n° to an approximate total of 2n? operations. It is easy to see that 
the operation of multiplying a vector by the inverse matrix requires 
n(2n — 1) operations; hence the whole computation of first constructing 
the inverse matrix, and then multiplying each right-hand side by the in- 
verse, requires a total of 2n?+2kn? multiplications (ignoring terms of size 
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O(n)). This is always greater than the previous value 2n3 + (2k — $) n?, 


whether / is small or large. The most efficient way of solving this prob- 
lem is to construct and save the L and U factors of A, rather than to 
form the inverse of A. 


2.7 Norms and condition numbers 


The analysis of the effects of rounding error on solutions of systems of 
linear equations requires an appropriate measure. This is provided by 
the concept of norm defined below. In order to motivate the axioms of 
norm stated in Definition 2.6, we note that the set R of real numbers is 
a linear space, and that the absolute value function 


v if v > 0, 


has the following properties: 


e |v| > 0 for any v ER, and |v| = 0 if, and only if, v = 0; 
e |Xv| = |A| |v] for all A € R and all v E R; 
e ju+v| < lu} +t o| for all u and v in R. 


The absolute value |v| of a real number v measures the distance between 
v and 0 (the zero element of the linear space R). Our next definition 
aims to generalise this idea to an arbitrary linear space V over the field 
R of real numbers: even though the discussion in the present chapter 
is confined to finite-dimensional linear spaces of vectors (V = R”) and 
square matrices (V = R”*”), norms over other linear spaces, including 
infinite-dimensional function spaces, will appear elsewhere in the text 
(see Chapters 8, 9, 11 and 14). 


Definition 2.6 Suppose that V is a linear space over the field R of 
real numbers. The nonnegative real-valued function || - || is said to be a 
norm on the space V provided that it satisfies the following axioms: 

@ |\v|| =0 if and only if, v=0 in V; 

@ ||rv|| = |A| ||v|| for all A € R and all v in V; 

® |ju+v|| < |lul|+llol] for all u and v in V (the triangle inequality). 


A linear space VY, equipped with a norm, is called a normed linear 
space. 


Remark 2.1 Jf V is a linear space over the field C of complex numbers, 
then R in the second axiom of Definition 2.6 should be replaced by C, 
with |A| signifying the modulus of X € C. 
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Any norm on the linear space V = R” will be called a vector norm. 
Three vector norms are in common use in numerical linear algebra: the 
1-norm || - ||1, the 2-norm (or Euclidean norm) || - ||2, and the co-norm 
|| - |l0o; these are defined below. 


Definition 2.7 The 1-norm of the vector v = (v1,...,Un)' € R” is 
defined by 


lela = > [vil . (2.32) 


Definition 2.8 The 2-norm of the vector v = (v1,...,Un)' € R” is 
defined by ||v||2 = (vv) '/?. In other words, 
1/2 


I|v\l2 = De vil? > (2.33) 


Definition 2.9 The oo-norm of the vector v = (v1,-.-,Un)? € R” is 
defined by 
|olinc = mide. (2.34) 


When n = 1, each of these norms collapses to the absolute value, | - |, 
the simplest example of a norm on VY = R. 

It is easy to show that || - ||; and || - ||. obey all axioms of a norm. 
For the 2-norm the first two axioms are still trivial to verify; to show 
that the triangle inequality is satisfied by the 2-norm requires use of the 
Cauchy!—Schwarz? inequality. 


Lemma 2.2 (Cauchy—Schwarz inequality) 
So wins < ||20||2|| eI Vu, ve R”. (2.35) 
i=1 


1 Augustin-Louis Cauchy (21 August 1789, Paris, France — 23 May 1857, Sceaux 
(near Paris), France) made very significant contributions to algebra and number 
theory. He was one of the founders of modern mathematical analysis, the theory 
of complex functions, and the mathematics of elasticity theory. 

Karl Herman Amandus Schwarz (25 January 1843, Hermsdorf, Silesia, Germany 
(now in Poland) — 30 November 1921, Berlin, Germany) succeeded Karl 
Weierstrass as Professor of Mathematics at Berlin in 1892. Outside mathematics 
he acted as captain of the local Voluntary Fire Brigade, and helped the station- 
master at the local railway station by closing the doors of the trains. 
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Proof The proof of this inequality is rather simple: for any u and v in 
R”, and all AER, 
0 < |jAu+ v|/} = So Oui +;)? 


i=1 
i=l i=1 i=1 


Hence, the expression on the right is a nonnegative quadratic polyno- 
mial in \ € R, of the form AA? + B\ + C; therefore, the associated 
discriminant, 


n 2 n n 
B? —4AC = (232m) —4 & ml (>: mF . 
i=l i=l i=l 


is nonpositive. This implies (2.35) on recalling Definition 2.8. 


I 


The triangle inequality for the 2-norm is now deduced as follows: let- 
ting A = 1 in (2.36) and using (2.35), it follows that 


ljut+ollz = uli +2 >) wins + [lela 
< |lulla + 2llullallyll2 + olla 


= (|lull2+ |lell2)” , 


which yields the triangle inequality in the 2-norm on taking square roots. 
Hence || - ||2 satisfies all three axioms of norm. 

The 1-norm and the 2-norm on R” are special cases of the p-norm, 
defined on R”, for p > 1, by 


n 1/p 
Ilvllp = 3 mr | (2.37) 


The first two axioms of norm are trivial to verify for || - ||,; however, 
showing the triangle inequality is less straightforward (except for p = 1, 
and for p = 2, as we have already seen before); we shall now sketch the 
proof of this for p > 1. The starting point is the following result, known 
as Young’s inequality.! 

1 William Henry Young (20 October 1863, London, England — 7 July 1942, Lau- 
sanne, Switzerland) studied mathematics at Peterhouse, Cambridge. His most im- 
portant contributions were to the calculus of functions of several variables. Young 
was elected Fellow of the Royal Society in 1907; he was president of the London 


Mathematical Society (1922-1924) and president of the International Union of 
Mathematicians (1929-1936). 
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Theorem 2.4 (Young’s inequality) Let p,q > 1, (1/p) + (1/q) = 1. 
Then, for any two nonnegative real numbers a and b, 


Proof If either a = 0 or 6b = 0 the inequality holds trivially. Let us 
therefore suppose that a > 0 and b > 0. We recall that a function 
x € Rt f(x) € Ris said to be convex if 


f(Gx + (1 — @)y) < Of(x) + (1 — 4) f(y) 


for all 6 € [0,1], and all x and y in R; z.e., for any x and y in R the 
graph of the function f between the points (a, f(x)) and (y, f(y)) lies 
below the chord that connects these two points. Note that the function 
x ++ e® is convex. Therefore, with 6 = 1/p and 1—@ = 1/q, we get that 


1 Pp 1 qd aP b¢ 
Zina 4 ¢ine =) re 


Pp q Pp q 
and the proof is complete. (When p = q = 2 the proof is trivial: as 
(a — b)? > 0 also 2ab < a? + b, and hence the required result.) 


ab = elnatin’ = e(t/P) Ina?+(1/q) Inb? # 


7 


The next step is to establish Hélder’s inequality;! it is a generalisation 
of the Cauchy—Schwarz inequality. 


Theorem 2.5 (H6élder’s inequality) Let p,q > 1, (1/p) + (1/q) = 1. 
Then, for any wu € R” and v € R”, we have 


nm 
S UzUi 
i=1 


< |lullplllla- 


Proof If either u = 0 or v = O the inequality holds trivially. Let us 
therefore suppose that w 4 0 and v ¥ O, and consider the vectors u and 
v in R” with components &; = u;/||ul|, and 0; = v;/||v||q, respectively, 
i=1,2,...,n. By Young’s inequality, 


i=l 


Inserting the defining expressions for U; and 0; into the left-most expres- 


sion in this chain, the result follows. 


1 Otto Ludwig Hédlder (22 December 1859, Stuttgart, Germany — 29 August 1937, 
Leipzig, Germany) contributed to group theory; we owe him the concepts of factor 
group, and inner and outer automorphisms. Hélder discovered the inequality now 
named after him in 1884 while working on the convergence of Fourier series. 
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The triangle inequality in the p-norm is referred to as Minkowski’s 
inequality.+ 


Theorem 2.6 (Minkowski’s inequality) Let 1 < p < co and u,v € 
R”. Then, 


[e+ lly < [lellp + [lolly - 


Proof As we noted earlier, the proof of this inequality for p = 1 and 
p = © is easy. Let us therefore focus on the case 1 < p < ow. In the 
nontrivial case of u #0 and v £ O, Hdlder’s inequality yields 


nm nm 
leet ol = So let? SDS us + vil? ? (eal + [eal) 
i=l 4=1 
p-l aly ei 
n Pp nm Pp n Pp 
< (Smear) ” ((Scmue) + (Sor 
i=l 4=1 4=1 


= |lwt vile? (lellp + llelly) 5 


and hence the desired result on dividing through by ||u + v||2~'. 


Remark 2.2 For a nonzero element u in R”, let t = (|lulloo)~ hu. 
Clearly, 1 < ||&||p < n1/”, and hence limp. ||t@||p = 1. Therefore, 


alloc = Jim |fullp, WER”. 


This identity justifies our use of the notation || - ||. for the maximum 
norm, defined by ||t||o. = max?_, |u;|, and our terminology: oo-norm. 


Remark 2.3 We note here that ||-||p, 1 <p < oo, is also a norm on the 
linear space C” of n-component vectors with compler entries, over the 
field C of complex numbers, provided that |v;| in the definition (2.37) of 
|| - |p 28 interpreted as the modulus of the complex number vj. 


In order to highlight the difference between || - ||1, || - ||2 and |] - |loo, 
in Figure 2.2 we plot the ‘unit spheres’ (or ‘unit circles’, in the case of 
n = 2) corresponding to these three norms on VY = R?. We recall that 


1 Hermann Minkowski (22 June 1864, Alexotas, Russia (now Kaunas, Lithuania) — 
12 January 1909, Gottingen, Germany) held a chair at the University of Gottingen, 
where he was exposed to Hilbert’s work on mathematical physics. Minkowski 
realised that the ideas of Lorentz and Einstein can be best understood in terms 
of non-Euclidean geometry, with space and time coupled into a four-dimensional 
continuum. He died at the age of 44 from a ruptured appendix. 
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Fig. 2.2. ‘Unit circles’ in the linear space V = R? with respect to three vector 
norms: (a) the 1-norm; (b) the 2-norm; (c) the co-norm. 


the unit sphere in a normed linear space V, with norm || - ||, is defined 
as the set {v € V: ||v|| = 1}. It can be seen from Figure 2.2 that 

{v € R*: ||v|]a <1} Cc {v E R®: |lv|lo < 1} C {v E R®: ||v|lan < 1}. 
We leave it to the reader as an exercise to show that analogous inclusions 
hold in R” for any n > 1. (See Exercise 8.) 


The unit sphere in a normed linear space Y with norm || - || is the 
boundary of the closed unit ball By(0) centred at 0 defined by 


By(0) = {v eV: lvl] <1}. 
Analogously, the open unit ball centred at 0 is defined by 
B,(0) = {ve PV: |jv|] <1}. 
More generally, for ¢ > 0 and € € V, 
B-(€) = {v € V: |lv — €|| < e} 
is the closed ball of radius € centred at €; analogously, 
B.(€) = {v EV: |lv El] <} 


is the open ball of radius € centred at &. 
Any norm on the linear space R”*” of n x n matrices with real entries 
will be referred to as a matrix norm. In particular, we shall now 
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consider matrix norms which are induced by vector norms in a sense 
that will be made precise in the next definition. 


Definition 2.10 Given any norm || - || on the space R” of n-dimensional 
vectors with real entries, the subordinate matrix norm on the space 
R°*" of n x n matrices with real entries is defined by 


A 
Al] = max (4el 
veR” ||| 


(2.38) 


In (2.38) we used R? to denote R” \ {0}, where, for sets A and B, 
A\B={xeAx¢ Bh. 


Remark 2.4 Let C”*” denote the linear space of n x n matrices with 
complex entries over the field C of complex numbers. Given any norm 
|| - || on the linear space C”, the subordinate matrix norm on C”*" 
is defined by 


— naw HAY I 
|| Al a vote \|v | 


2 


where C? = C” \ {0}. 


It is easy to show that a subordinate matrix norm satisfies the ax- 
ioms of norm listed in Definition 2.6; the details are left as an exercise. 
Definition 2.10 implies that, for A € R"*”, 


|Av || < |All llol], for allv € R®. 


In a relation like this any vector norm may be used, but of course it is 
necessary to use the same norm throughout. It follows from Definition 
2.10 that, in any subordinate matrix norm || - || on R"*”, 


I|Z|| =1 


where I is the n x n identity matrix. 

Given any vector v in R”, it is a trivial matter to evaluate each of the 
three norms ||v ||1, ||¥ |l2, ||v ||; however, it is not yet obvious how one 
can calculate the corresponding subordinate matrix norm of a given ma- 
trix A in R"*”. Definition 2.10 is unhelpful in this respect: calculating 
|| Al| via (2.38) would involve the unpleasant task of maximising the func- 
tion v & ||Av||/||v || over R® (or, equivalently, maximising w + || Aw| 
over the unit sphere {w € R”: ||w|| = 1}). This difficulty is resolved by 
the following three theorems. 
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Theorem 2.7 The matrix norm subordinate to the vector norm || - |loo 
can be expressed, for ann x n matrix A = (aij)1<i,j<n € R”™”, as 


|| Alloo = pepe [aig]. (2.39) 


This result is often loosely expressed by saying that the oo-norm of a 
matrix is its largest row-sum. 


Proof Given an arbitrary vector v in R®, write K = ||v ||, so that 


\u;| < K for 7 =1,2,...,n. Then, 


n n n 
|(Av);| = S > aij0; < S- |aij| lus] S KYO lass| , @=1,2,...,n. 
j=l j=l j=l 


Now we define 


*D 


C= max ) | |ai,| (2.40) 
j=l 
and note that 
| Av |loo _ maxji_, |(Av);| _ maxj_, |(Av) 


fe" Meo we 
Hence, || Aloo < C. 

Next we show that ||A||.. > C. To do so, we take v to be a vector in 
R® each of whose entries is +1, with the choice of sign to be made clear 
below. In the definition of C, equation (2.40), let m be the value of i 
for which the maximum is attained, or any one of the values if there is 


more than one. Then, in the vector v we give the element v; the same 
sign as that of am;; if @mj; happens to be zero, the choice of the sign of 
v; is irrelevant. With this definition of v we see at once that 


n 
n 
|| Av||o = max So aigv;| > Dame =¥ lem =S lem = = 


j=l 
As ||v||o0 = 1, it follows that 


|Av|loo 2 Cllulloo ; 


which means that || Al]. > C. Hence ||Al|o. = C, as required. 
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Theorem 2.8 The matrix norm subordinate to the vector norm || - ||1 
can be expressed, for ann x n matrix A = (aij)1<i,j<n € R"™”, as 


This is often loosely expressed by saying that the 1-norm of a matrix is 
its largest column-sum. The proof of this theorem is very similar to that 
of the previous one, and is left as an exercise (see Exercise 7). Note that 
Theorems 2.7 and 2.8 mean that the 1-norm of a matrix A = (a;;)1<i,j<n 
is the oo-norm of the transpose AT = (a5) 1<i,j<n Of the matrix. 

Before we state a characterisation of the subordinate matrix 2-norm, 
we recall the following definition from linear algebra. 


Definition 2.11 Suppose that A € R"*”". A complex number X, for 
which the set of linear equations 
Ax = \x 


has a nontrivial solution « € C? = C” \ {0}, is called an eigenvalue 
of A; the associated solution x € C? is called an eigenvector of A 
(corresponding to ). 


Now we are ready to state our result. 
Theorem 2.9 Let A € R"*%” and denote the eigenvalues of the matrix 
B=AT™A by ;,i1=1,2,...,n. Then, 


Alle = max a7”. 


Proof Note first that the matrix B is symmetric, i.e., B = B™; therefore 
all of its eigenvalues are real and the associated eigenvectors belong to 
R®. (You may wish to prove this: consult the proof of Theorem 3.1, part 
(ii), for a hint.) Moreover, all eigenvalues of B are nonnegative, since if 
v © R® is an eigenvector of B and 4 is the associated eigenvalue A, then 


A! Av = Bu =v 


and therefore 


es v ATAv _ ||Av|l3 


= >0. 
wT» [ule =" 


Suppose that the vectors w; € R?,i=1,2,...,n, are eigenvectors of B 
corresponding to the eigenvalues \;, i = 1,2,...,n. Since B is symmetric 
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we may assume that the vectors w; are orthogonal, i.e., w/w ; = 0 for 


i # j, and we can normalise them so that w/w; = 1 for i=1,2,...,n. 


a 
Now choose an arbitrary vector u in R? and express it as a linear com- 
bination of the vectors w;, 7 = 1,2,...,n: 


U= CW ++ + Cn Wy. 
Then, 
Bu = cy A,W1 +--+ + CpAnWn.- 
We may assume, without loss of generality, that 


(0 <) Ar < Ag < +++ SAn- 


Using the orthonormality of the vectors w;, 1 = 1,2,...,n, we get that 
| Aul|Z = uw? ATAu=ul Bu 
= At +Crn 
(c] + +++ +e,)An 
An|lull3 (2.41) 


IA 


I 


for any vector u € R®. Hence ||A||3 < A,. To prove equality we simply 
choose u = w,, in (2.41), so that cy = +--+ =cy_1 =0 and c, = 1. 


The square roots of the (nonnegative) eigenvalues of AT A are referred 
to as the singular values of A. Thus we have shown that the 2-norm 
of a matrix A is equal to the largest singular value of A. 

If the matrix A is symmetric, then B = ATA = A?, and the eigenval- 
ues of B are just the squares of the eigenvalues of A. In this special case 
the 2-norm of A is the largest of the absolute values of its eigenvalues. 


Theorem 2.10 Given that || - || is a subordinate matriz norm on R"*”, 


| AB] < |All |B 


for any two matrices A and B in R"*”. 


Proof From the definition of subordinate matrix norm, 


|| ABo | 
veR?  |[o|| 


* 


| AB 


|| ABv || < ||Al| || Bo || 
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Fig. 2.3. ‘Input’ « € D C V and ‘output’ f(x) € W for a mapping f: V — W. 


for all v € R®, we have 


IAB) << max HUL2e| 
~ veR? — |lv| 
= |Al| max [Bel 

vee ol 
= |AIIBI, 


and hence the desired result. 


Now we are ready to embark on the study of sensitivity to pertur- 
bations in the problem of matrix inversion. In order to motivate the 
concept of condition number of a matrix which will play a key role in 
the analysis, we begin with a discussion of ‘conditioning’ in a slightly 
more general context. 

Consider a mapping f from a subset D of a normed linear space V 
with norm || - || into another normed linear space W with norm || - ||w, 
depicted in Figure 2.3, where x € D C Y is regarded as the ‘input’ for f 
and f(a) € W is the ‘output’. We shall be concerned with the sensitivity 
of the output to perturbations in the input; therefore, as a measure of 
sensitivity, we define the absolute condition number of f by 


Cond(f) = sup ZW =f@)Ihw | (2.42) 
every lly — 2\lv 


If Cond(f) = +00 or if 1 < Cond(f) < +00, we say that the mapping 
f is ill-conditioned. 


Example 2.5 Consider the function f: x € D+ ./z, where D is a 
closed subinterval of [0,00). Clearly, if D = [1,2], then Cond(f) = 1/2, 
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while if D = [0,1], then Cond(f) = +oo. Indeed, in the latter case, 
perturbing x = 0 tox = €?, 0 < e <1, leads to a perturbation of the 
function value f(0) = 0 to f(e?) =e= te?: a magnification by a factor 
+ > 1 in comparison with the size of the perturbation in x. 


When || f(y)—f(2)||w/||y-2||v exhibits large variation as (x, y) ranges 
through D x D, it is more helpful to consider a finer, local measure of 
conditioning, the absolute local condition number, at x € DC Y, 
of the function f, defined by 


Cond,(f) = sup P(e + 62) = f(@)llw : (2.43) 


SaEV\ {0} \|6z\|v 
2+é6x2ED 


Example 2.6 Let us consider the function f: x € D+ \/z, defined 
on the interval D = (0,00). The absolute local condition number of f 
at x € D is Cond;(f) = 1/(2/z). Clearly, lim,z0+ Condz(f) = +00, 
limz—+00 Cond,(f) = 0. 


Although the definitions (2.42) and (2.43) seem intuitive, they are not 
always satisfactory from the practical point of view since they depend 
on the magnitudes of f(x) and «. A more convenient definition of con- 
ditioning is arrived at by rescaling (2.43) by the norms of f(x) and a. 
This leads us to the notion of relative local condition number 


enactp: gap, eee) 10) Ills lle 


5c€V\{0} l6z|lv/llzllv 
z+é6xED 


where it is implicitly assumed that « € V \ {0} and f(x) € W \ {0}. 
The next example highlights the difference between the absolute local 
condition number and the relative local condition number of f. 


Example 2.7 Let us consider the function f: x € D+ \/z, defined 
on the interval D = (0,00). Recall from the preceding example that the 
absolute local condition number of f atx € D approaches +00 as x tends 
to zero. In contrast with this, the relative local condition number of f is 
cond,(f) = 1/2 for all x € D. 

You may also wish to ponder the following, seemingly paradoxical, 
observation: limz_.o cond,(sin) = 1 and lim,_9 cond,;_-(sin) = oo, even 
though sin0 = sina = 0 and Condo(sin) = Cond,(sin) = 1. 


Since the present section is concerned with the solution of the linear 
system Axv = b, where A € R”*” is nonsingular and b € R”, let us 
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consider the relative local condition number of the mapping 
Al.:bER"» AlbER"” 


at b © R? = R” \ {0}. We suppose that R” has been equipped with a 
vector norm || - || and, since there is no danger of confusion, we denote 
the associated subordinate matrix norm by ||- || also. Noting that A7}- 
is defined on the whole of R”, it follows that D = V = R”, W = R” and 
we deduce that 


|A7*(b + 6b) — A~*b|| / || A~*6|| 


i aes Maer [sb] /TeI 
a et oe 
Since ||b|| = || A(A~"b)|| < || Al i we conclude that 
cond,(A~!:) < ||A~* || |All (2.44) 


If now, instead, we consider the mapping 
A-:x%€R"» Arve R", 
an identical argument shows that, for x € R®, 
conda(A-) < |All [AI]. (2.45) 


The inequalities (2.44) and (2.45) indicate that the number || A~"|| || Al] = 
|| All || A7+]| plays a relevant role in the analysis of sensitivity to pertur- 
bations in numerical linear algebra; therefore we adopt the following 
definition. 


Definition 2.12 The condition number of a nonsingular matriz A 
is defined by 


w(A) = ||Al| |A7*I. 


Clearly, «(A~') = «(A). Further, since AA~! = J, it follows from 
Theorem 2.10 that «(A) > 1 for every matrix A. If «(A) > 1, the 
matrix is said to be ill-conditioned. Evidently the condition number 
of a matrix is unaffected by scaling all its elements by multiplying by a 
nonzero constant.! 

1 We note in passing that, more generally, the condition number of a matrix A € 

R™*” is defined by K(A) = ||Alj || At || where A+ is the Moore—Penrose generalised 

inverse of A. In the special case when m = n and A is nonsingular, At = A7!. 


For further details in this direction, we refer to the Notes at the end of the chapter. 
Here, the norm || - || on R’*” is defined as in (2.38). Theorems 2.7 and 2.8 are 
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There is a condition number for each norm; for example, if we use the 
2-norm, then K2(A) = ||All2 || A~*||2, and so on. Indeed, the size of the 
condition number of a matrix A € R”*” is strongly dependent on the 
choice of the norm in R”. In order to illustrate the last point, let us 
consider the unit lower triangular matrix A € R”*” defined by 


1 0 O O 0 
1 1 0 O 0 
1 0 1 0 0 
a 1 O 0O 1 0 : (248) 
1 0 O O 1 
and note that its inverse is 
1 0 O O 0 
-1 1 0 O 0 
-—1 0 1 0 0 
aes eee, 
ae -l1 0 0 1 0 
-l1 0 0 0 1 
Since 
pee and = ||A“* a =n, 
it follows that «1(A) = n*. On the other hand, 


|A]loo = 2 and (Ale 2. 


so that K..(A) = 4 <« n? = K1(A) when n > 1. (A question for the 
curious: how does the condition number «2(A) of the matrix A in (2.46) 
depend on the size n of A? See Exercise 11.) 

It is left as an exercise to show that for a symmetric matrix A (i.e., 
when AT = A), the 2-norm condition number K2(A) is the ratio of the 
largest of the absolute values of the eigenvalues of A to the smallest of 
the absolute values of the eigenvalues (see Exercise 9). 


easily extended to show that, for A € R™*”, 


m 


n 
\|Alloo = max eu aij} and ||Alj1 = pe lag;|. 


The 2-norm of A, || A||2, is equal to the largest singular value of A, i.e., the square 
root of the largest eigenvalue of the matrix ATA € R"*”, just as in Theorem 2.9. 
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We can now assess the sensitivity of the solution of the system Ax = b 
to changes in the right-hand side vector b. 


Theorem 2.11 Suppose that A € R"*” is a nonsingular matriz, b € R®, 
Az = b and A(x + 6x) = b+ 6b, with 6x, 6b € R”. Then, x € R? and 
\|dx|| ||] 


Te 5 eT 


Proof Evidently, 
b=Azxz and 6x=A~'(b+6b)—a2 =A ‘6b. 


As b # 0 and A is nonsingular, the first of these implies that x 4 0. 
Further, 


[]>]| < |All [la] and ||Sa|| < || A~*|| 65]. 


The result follows immediately by multiplying these inequalities. 


Owing to the effect of rounding errors during the calculation, the 
numerical solution of Ax = b will not be exact. The numerical solution 
may be written x+6z, and we shall usually find that this vector satisfies 
the equation A(a+6ax) = b+6b, where the elements of 6b are very small. 
If the matrix A has a large condition number, however, the elements of 
6x may not be so small. An example of this will be presented in the 


next section. 


2.8 Hilbert matrix 
We consider the Hilbert matrix! H,, of order n, whose elements are 


hs = C69 3D os rey 
This matrix is symmetric and positive definite (i.e., Hj = Hy, and 
x'H,«x > 0 for all 2 € R”), and therefore all of its eigenvalues are real 
and positive (cf. Theorem 3.1, part (ii)). However, H, becomes very 
nearly singular as n increases. Table 2.1 shows the largest and smallest 
eigenvalues, and the 2-norm condition number k2(H,,) of H,,, for various 
values of n. 
! David Hilbert (23 January 1862, Konigsberg, Prussia (now Kaliningrad, Russia) — 
14 February 1943, Gottingen, Germany) was the most prominent member of the 
Gottingen school of mathematics. He made significant contributions to many areas 


of the subject, including algebra, geometry, number theory, calculus of variations, 
functional analysis, integral equations, and the foundations of mathematics. 
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Table 2.1. Eigenvalues and condition number of the Hilbert matrix Hy. 


n Amax Amin k2(Hn) 


5 1.6 3.3x107& 4.8x 10° 
10 1.8 111x107 1.6~x 10! 
15 1.8 3.0x 10774 6.1 x 107° 
20 1.9 7.8x 10779 25x 1078 
25 2.0 1.9 x 1073 1.0 x 10°° 


Fig. 2.4. Condition number k2(H,,) of the Hilbert matrix H, of size n = 
2,3,...,12 in the 2-norm, against n, in a semilogarithmic-scale plot. 


Figure 2.4 depicts the logarithm of the condition number «2(H,,) in 
the 2-norm of the Hilbert matrix H,, against its order, n; the straight line 
in our semilogarithmic-scale plot indicates that 2(H,,), as a function of 
n, exhibits exponential growth. Indeed, it can be shown that 


(/2% gi 
2185/4. /mn 
We now define the vector b with elements 6; = )7_.(j/(i +g — 1)), 

i = 1,2,...,n, chosen so that the solution of Ax = b, with A = Hy, 

is the vector x with elements 7; = i, 7 = 1,2,...,n. We obtain a 

numerical solution of the system, using the method described in Section 


ko(H,) ~ 


as 27 ©. 


74 2 Solution of systems of linear equations 


2.5 to give the calculated vector x + 6a, and then compute the residual 
6b from A(a+ 6x) = b+6b. The calculation uses arithmetic operations 
correct to 15 decimal digits, which is roughly the accuracy used by many 
computer systems. The results are listed in Table 2.2. 


Table 2.2. Rounding errors in the solution of Hy,x = b, where Hy is 
the Hilbert matrix of order n and b = (1,2,...,n)*. 


mn |[6b]|2/|b|l2_— |!6ael|2/||#ll2 


5 1.2x107 85x 107U 
10 1.7x107! 1.3 x 1073 
15 2.8x 107! 4.1 
20 63x 107% 8.7 
25 1.91078 5.5 x 10? 


The relative size of the residual is, in nearly every case, about the size 
of the basic rounding error, 10~!°. The resulting errors in x are smaller 
than the bound given by Theorem 2.11, as might be expected, since 
that bound corresponds to the worst possible case. In any case, for the 
Hilbert matrix of order greater than 14 the error is larger than the calcu- 
lated solution itself, which renders the calculated solution meaningless. 
For matrices of this kind the condition number and the bound given 
by Theorem 2.11 are so large that they have little practical relevance, 
though they do indicate that, due to sensitivity to rounding errors, the 
numerical calculations are of unreliable accuracy. 

The Hilbert matrix is, of course, a rather extreme example of an ill- 
conditioned matrix. However, we shall meet it in an important problem 
in Section 9.3 concerning the least squares approximation of a function 
by polynomials, where we shall see how a reformulation of the problem 
using an orthonormal basis avoids the disastrous loss of accuracy that 
would otherwise occur. In the next section, we introduce the idea of 
least squares approximation in the context of linear algebra and consider 
the solution of the resulting system of linear equations using the QR 
algorithm; this, too, relies on the notion of (ortho)normalisation. 


2.9 Least squares method 


Up to now, we have been dealing with systems of linear equations of 
the form Aa = b where A € R”*". However, it is frequently the case 
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in practical problems (typically, in problems of data-fitting) that the 
matrix A is not square but rectangular, and we have to solve a linear 
system of equations Ax = b with A € R™*", bE R™, with m > n; since 
there are more equations than unknowns, in general such a system will 
have no solution. Consider, for example, the linear system (with m = 3, 
n = 2) 


by adding the first two of the three equations and comparing the result 
with the third, it is easily seen that there is no solution. If, on the 
other hand, m <n, then the situation is reversed and there may be an 
infinite number of solutions. Consider, for example, the linear system 
(with m= 1, n = 2) 


any vector x = (1,1 — 3)", with p € R, is a solution to this system. 

Suppose that m > n; we may then need to find a vector a € R” 
which satisfies Az — b = O in R” as nearly as possible in some sense. 
This suggests that we define the residual vector r = Aa — b and require 
to minimise a certain norm of r in R™. From the practical point of 
view, it is particularly convenient to minimise the residual vector r in 
the 2-norm on R”™; this leads to the least squares problem: 

Minimise || Aa — bll2. 
a € R” 

This is clearly equivalent to minimising the square of the norm; so, on 
noting that 


|| Aw — b|)3 = (Aw — b)" (Aw — b), 
the problem may be restated as 


Minimise (Aa — b)"(Ax — b). 
a2 € R” 


Since 


(Aw — b)" (Ax — b) = x AT Ax — 247 ATH + b'd, 


the quantity to be minimised is a nonnegative quadratic function of the 
nm components of the vector x; the minimum therefore exists, and may 
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be found by equating to zero the partial derivatives with respect to the 
components. This leads to the system of equations 


Ba = Ab, where B= ATA. 


The matrix B is symmetric, and if A has full rank, n, then B is non- 
singular; it is called the normal matrix, and the system Ba = Ab is 
called the system of normal equations. 

The normal equations have important theoretical properties, but do 
not lead to a satisfactory numerical algorithm, except for fairly small 
problems. The difficulty is that in a practical least squares problem the 
matrix A is likely to be quite ill-conditioned, and B = ATA will then be 
extremely ill-conditioned. For example, if 


G2) 


where € € (0,1), then K2(A) = e~! > 1, while 
ko(B) = K2(AT A) = e? = e }k2(A) > K2(A) 


when 0 < ¢ < 1. If possible, one should avoid using a method which 
leads to such a dramatic deterioration of the condition number. 

There are various alternative techniques which avoid the direct con- 
struction of the normal matrix ATA, and so do not lead to this extreme 
ill-conditioning. Here we shall describe just one algorithm, which begins 
by factorising the matrix A, but using an orthogonal matrix rather than 
the lower triangular factor as in Section 2.3. 


Theorem 2.12 Suppose that A € R™*” where m>n. Then, A can be 
written in the form 

A=QR, 
where R is an upper triangular n x n matrix, and Q isanmxn matriz 
which satisfies 


Q™Q=In, (2.47) 


where I, is the n x n identity matrix; see Figure 2.5. If rank(A) = n, 
then R is nonsingular. 


Proof We use induction on n, the number of columns in A. The theorem 
clearly holds when n = 1 so that A has only one column. Indeed, writing 
c for this column vector and assuming that c 4 0, the matrix Q has just 
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Q R 


Fig. 2.5. QR factorisation of A € R™*", m > n: A = QR, Q © R™", 
QTQ = In, and the matrix R € R"*” is upper triangular. 


one column, the vector ¢/||cl|2, and R has a single clement, |lel|2. In 
the special case where c is the zero vector we can choose R to have 
the single element 0, and Q to have a single column which can be an 
arbitrary vector in R™ whose 2-norm is equal to 1. 

Suppose that the theorem is true when n = k, where 1 < k < m. 
Consider a matrix A which has m rows and k + 1 columns, partitioned 
as 


A= (Ag a), 


where a € R”™ is a column vector and A, has k columns. To obtain the 
desired factorisation QR of A we seek Q = (Qk q) and 


Dy R, Tr 
oe) 


Ahead) =O; mee |): 


such that 


0 a 


Multiplying this out and requiring that QTQ = I,41, the identity matrix 
of order k + 1, we conclude that 

Ae = Oph: (2.48) 

a = Qirtaa, (2.49) 

QEQe = Ik, (2.50) 

gq’ On. = 0", (2.51) 

qq = 1. (2.52) 
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These equations show that OnRe is the factorisation of A;, which exists 
by the inductive hypothesis, and then lead to 


r= Qa, 
q = (1/a)(a-Q.Qfa), 
where a = ||a — Q,Q}all2. The number a is the constant required to 


ensure that the vector q is normalised. 

The construction fails when a — Q.Qra = 0, for then the vector q 
cannot be normalised. In this case we choose q to be any normalised 
vector in R™ which is orthogonal in R™ to all the columns of Qk, for 
then q' Qk = 0" as required. The condition at the beginning of the 
proof, that k < m, is required by the fact that when k = m the matrix 
Qm is a square orthogonal matrix, and there is no vector q in R™ \ {0} 
such that q?Q», = 0". 

With these definitions of q,r,a, Or and R, we have constructed the 
required factors of A, showing that the theorem is true when n = k +1. 
Since it holds when n = 1 the induction is complete. 

Now, for the final part, suppose that rank(A) = n. If R were singular, 
there would exist a nonzero vector p € R” such that Rp = 0; then, 
Ap = QRp = 0, and hence rank(A) < n, contradicting our hypothesis 
that rank(A) = n. Therefore, if rank(A) = n, then R is nonsingular. 


The matrix factorisation whose existence is asserted in Theorem 2.12 
is called the QR factorisation. Here, we shall present its use in the 
solution of least squares problems. In Chapter 5 we shall revisit the idea 
in a different context which concerns the numerical solution of eigenvalue 
problems. 


Theorem 2.13 Suppose that AG R™*”, with m > n and rank(A) = n, 
and let b © R™. Then, there exists a unique least squares solution of 
the system of equations Ax = b: a vector x in R” which minimises the 
function y & ||Ay — bl|z over all y in R". The vector x can be obtained 
by finding the factors Q and R of A defined in Theorem 2.12, and then 
solving the nonsingular upper triangular system Ra = Q™b. 


Proof The matrix Q has m rows and n columns, with m > n, and it 
satisfies 


O° 0'= In. 
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We shall suppose that m > n, the case m = n being a trivial special 
case with 


a= A'b=(QR)'b=R'Q7'b= R'Q7, 


and hence Rx = Q™b, as required. 
For m > n now, the vector b € R™ can be written as the sum of two 
vectors: 


b=b, +b, 


where b, is in the linear space spanned by the n columns of the matrix Q, 
and b, is in the orthogonal complement of this space in R™. The vector 
b, is a linear combination of the columns of Q, and 6,. is orthogonal to 
every column of Q: i.e., there exists c € R” such that 


b=b,+6,, by =Qce, Q™b,=0. (2.53) 


Now, suppose that x is the solution of Ra = Q™b, and that y is any 
vector in R”. Then, 


Ay—b = QRy —b 


= QRy-2)+QRa—b 

= QRy-«x)+QQTb-b 

= QRy— x) + QQTb, — by + QQ7, — by 
= QRy- 2x) +QQ"Qe— b, — b, 

= QRy-2)-b,, 


where we have used (2.53) repeatedly; in particular, the last equality 
follows by noting that QTQ = I,. Hence 


|Ay—b3 = (y—a)’R™Q*QR(y — x) + bb, — 2(y— x)" RQ, 
= |R(y—2)||3 + [lb-(l 
> |b, ||? 


since QTb, = 0. Thus ||Ay — b]|2 is smallest when R(y — x) = 0, which 
implies that y = a, since the matrix R is nonsingular. Hence a, defined 


as the solution of Ra = Q™D, is the required least squares solution. 


2.10 Notes 


There are many good books on the subject of numerical linear algebra 
which cover the topics discussed in this chapter in much greater detail, 
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and address questions which we have not touched on here. Without any 
attempt to be exhaustive, we single out four texts from the vast litera- 
ture. The first two books on the list below are well-known monographs 
on the subject, while the last two are excellent textbooks. 


» G.H. GOLUB AND C.F. VAN LOAN, Matrix Computations, Third 
Edition, Johns Hopkins University Press, Baltimore, 1996. 

» N.J. HIGHAM, Accuracy and Stability of Numerical Algorithms, SIAM, 
Philadelphia, 1996. 

» L.N. TREFETHEN AND D. Bau, III, Numerical Linear Algebra, SIAM, 
Philadelphia, 1997. 

» P.G. CIARLET, Introduction to Numerical Linear Algebra and Opti- 
misation, Cambridge University Press, Cambridge, 1989. 


As we have already noted in Section 2.2, the invention of the elimi- 
nation technique is attributed to Gauss who published the method in 
his Theoria motus (1809), although the idea was already known to the 
Chinese two thousand years ago. Gauss himself was concerned with pos- 
itive definite systems. The method was extended to linear systems with 
general matrices by Jacobi.! The interpretation of Gaussian elimination 
as matrix factorisation is due to P.S. Dwyer: A matrix presentation of 
least squares and correlation theory with matrix justification of improved 
methods of solutions, Ann. Math. Stat. 15, 82-89, 1944. 

The sensitivity of Gaussian elimination to rounding errors was studied 
by Wilkinson? in Error analysis of direct methods of matrix inversion, J. 
Assoc. Comput. Math. 8, 281-330, 1961. The idea of pivoting was used 
as early as 1947 by von Neumann? and Goldstein.4 The concept of the 
condition number of a matrix was introduced by Turing? in Rounding- 
off errors in matrix processes, Quart. J. Mech. Appl. Math. 1, 287- 
308, 1948. Our treatment of condition numbers follows the textbook of 
Trefethen and Bau, cited above. 

1 Carl Gustav Jacob Jacobi (10 December 1804, Potsdam, Prussia, Holy Roman Em- 
pire (now Germany) — 18 February 1851, Berlin, Germany) had made important 
contributions to the theory of elliptic functions and differential equations. The En- 
glish translation, by G.W. Stuart, of Jacobi’s German original article is available 
from the Internet on ftp://thales.cs.umd.edu/pub/biographical/xhist .html 

2 James Hardy Wilkinson (27 September 1919, Strood, Kent, England — 5 October 

1986, London, England). 

John von Neumann (28 December 1903, Budapest, Austria-Hungary (now in 

Hungary) — 8 February 1957, Washington DC, USA). 

4 Sydney Goldstein (3 December 1903, Hull, England — 22 January 1989, Belmont, 
Massachusetts, USA). 


5 Alan Mathison Turing (23 June 1912, London, England — 7 June 1954, Wilmslow, 
Cheshire, England). 
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Normed linear spaces play a key role in functional analysis (see, for ex- 
ample, K. Yosida, Functional Analysis, Third Edition, Springer, Berlin, 
1971, page 30). Here, we have concentrated on finite-dimensional normed 
linear spaces over the field of real numbers. 

The relevance of norms in numerical linear algebra was highlighted by 
Householder! in his book The Theory of Matrices in Numerical Analysis, 
Blaisdell, New York, 1964. 

The idea of least squares fitting is due to Gauss, who invented the 
method in the 1790s. However, it was the French mathematician Le- 
gendre? who first published the method in 1806 in a book on deter- 
mining the orbits of comets. Legendre’s method involved a number of 
observations taken at equal intervals and he assumed that the comet 
followed a parabolic path, so he ended up with more equations than 
there were unknowns. Legendre then applied his methods to the data 
known for two comets. In an Appendix to the book Legendre described 
the least squares method of fitting a curve to the data available. Gauss 
published his version of the least squares method in 1809 and, although 
acknowledging that it had already appeared in Legendre’s book, Gauss 
nevertheless claimed priority for himself. This greatly hurt Legendre, 
leading to one of the infamous priority disputes in the history of math- 
ematics. A recent exhaustive monograph on numerical algorithms for 
least squares problems is due to A. Bjérk: Numerical Methods for Least 
Squares Problems, SIAM, Philadelphia, 1996. 

The version of the QR factorisation considered here is the reduced 
version, following the terminology in Chapter 7 of Trefethen and Bau. 
In the full version of the QR factorisation for a matrix A € R™*", we 
have A = QR, where Q € R™*™, RE R™*” (cf. Chapter 5). 

In a footnote to Definition 2.12 we mentioned the Moore—Penrose 
generalised inverse At of a matrix A € R™*". At can be defined 
through the singular value decomposition of A (cf. L.N. Trefethen and D. 
Bau, III: Numerical Linear Algebra, SIAM, Philadelphia, 1997). Recall 
that the singular values of A are the square roots of the (nonnegative) 
eigenvalues of the matrix ATA. 

1 Alton Scott Householder (5 May 1904, Rockford, Illinois, USA — 4 July 1993, 
Malibu, California, USA) was one of the pioneers of numerical linear algebra. 
Householder’s obituary by G.W. Stuart, published in SIAM News, is available from 
http://www.inf.ethz.ch/research/wr/conferences/householder/stewart.html 


2 Adrien-Marie Legendre (18 September 1752, Paris, France — 10 January 1833, 
Paris, France). 
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Theorem 2.14 (Singular value decomposition) Let A € R™*"; 
then, there exist U € R™*", SE R"*” and V € R"*” such that 


A=UxV', 


where % is a diagonal matriz whose diagonal entries, o;;,1 = 1,2,...,n, 
are the singular values of A, UTU = In and V'V = In, with In denoting 
the n x n identity matriz. 


The Moore—Penrose generalised inverse of the diagonal matrix © € 
R”*” is defined as the diagonal matrix + € R”*”" whose diagonal 
entries are 
he oe if Ou x 0, 

The generalised inverse At € R"*” of a matrix A € R™*” with singular 
value decomposition A = UXV" is defined by 


At=v~etut. 


In the special case when m = n and A € R”*” is nonsingular, the n 
singular values of A are all nonzero and therefore ©+ = =~. Hence, 
also, At = A~!, which then justifies the use of the terminology ‘gener- 
alised inverse’ for the matrix At defined above. 


Exercises 


2.1 Let n > 2. Given the matrix A = (a;;) € R"*”, the permutation 
matrix Q € R”*” reverses the order of the rows of A, so that 
(QA)ij = Qn41-i,;- If L € R"*” is a lower triangular matrix, 
what is the structure of the matrix QLQ? 
Show how to factorise A € R”*” in the form A = UL, where 
U € R”*” is unit upper triangular and L € R”*” is lower trian- 
gular. What conditions on A will ensure that the factorisation 
exists? Give an example of a square matrix A which cannot be 
factorised in this way. 
2.2 Let n > 2. Consider a matrix A € R”*” whose every leading 
principal submatrix of order less than n is nonsingular. Show 
that A can be factored in the form A = LDU, where L € R"*” 
is unit lower triangular, D € R"*” is diagonal and U € R”*” is 
unit upper triangular. 
If the factorisation A = LU is known, where L is unit lower 


2.3 


2.4 


2.5 


2.6 
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triangular and U is upper triangular, show how to find the fac- 
tors of the transpose A’. 

Let n > 2 and suppose that the matrix A € R”*” is nonsingular. 
Show by induction, as in Theorem 2.3, that there are a permuta- 
tion matrix P € R”*”, a lower triangular matrix LD € R"*”, and 
a unit upper triangular matrix U € R"*” such that PA = LU. 

By finding a suitable 2 x 2 matrix A, or otherwise, show that 

this may not be true if A is singular. 
The lower triangular matrix LD € R"*”", n > 2, is nonsingular, 
and the vector b € R” is such that b; = 0, 7 = 1,2,...,k, with 
1<k<n. The vector y € R” is the solution of Ly = b. 
Show, by partitioning L, that y; = 0, 7 = 1,2,...,k. Hence 
give an alternative proof of Theorem 2.1(iv), that the inverse of 
a nonsingular lower triangular matrix is itself lower triangular. 
Given a matrix A € R”*", define the matrix B € R"*?” in 
which the first n columns are the columns of A, and the last n 
columns are the columns of the identity matrix J,,. Consider the 
following computational scheme. Treat the rows of the matrix 
B in order, so that 7 = 1,2,...,n. Multiply every element in 
row j by the reciprocal of the diagonal element, 1/b,;; then, 
replace every element b;, which is not in row j, so that 7 4 J, 
by bin — b5;bj%- 

Show that the result is equivalent to multiplying B on the 
left by a sequence of matrices. Explain why, at the end of the 
computation, the first n columns of B are the columns of the 
identity matrix [,,, and the last n columns are the columns of 
the inverse matrix A~!. Give a condition on the matrix A which 
will ensure that the computation does not break down. 

Show that the process as described requires approximately 
2n3 multiplications, but that, if the multiplications in which one 
of the factors is zero are not counted, the total is approximately 
n3, 

Use the method of Exercise 5 to find the inverse of the matrix 


2 4 2 
A={1 0 3 
3 1. 2 
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2.9 


2 Solution of systems of linear equations 


Suppose that for a matrix A € R”*”, 
n 
Slaul<c, FH yp 2 veg hs 
i=l 


Show that, for any vector « € R”, 


n 


So |(Az)s| < Cllall- 


i=1 
Find a nonzero vector x for which equality can be achieved, and 
deduce that 


nm 
|| Alla = max S© Jajj]. 
ce =| 


(i) Show that, for any vector v = (v1,...,Un)? € R®, 
Ilelloo < llullz and ||v|]2 < [lela |llloo - 


In each case give an example of a nonzero vector v for which 
equality is attained. Deduce that ||vl|o. < ||vll2 < |lu|l1. Show 
also that ||v|lo < /7||v|loo. 

(ii) Show that, for any matrix A € R™*”, 


|Alloo < Vn ||All2 and ||Allo < Vml|Alloo- 


In each case give an example of a matrix A for which equality 
is attained. (See the footnote following Definition 2.12 for the 
meaning of ||Al|1, ||Al|2 and ||A||.. when A € R™*”.) 

Prove that, for any nonsingular matrix A € R"*”, 


where A, is the smallest and ,, is the largest eigenvalue of the 
matrix ATA. 

Show that the condition number «2(Q) of an orthogonal ma- 
trix Q is equal to 1. Conversely, if K2(A) = 1 for the matrix A, 
show that all the eigenvalues of ATA are equal; deduce that A 
is a scalar multiple of an orthogonal matrix. 

Let A € R"*". Show that if is an eigenvalue of ATA, then 


O<AS|A MAI, 


provided that the same subordinate matrix norm is used for 


2.11 


2.12 


2.13 


2.14 
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both A and AT. Hence show that, for any nonsingular n x n 
matrix A, 


K2(A) < {61(A) Kioo(A)}? . 


For the matrix defined by (2.46) write down the matrix ATA. 
Show that any vector « # 0 is an eigenvector of AA with 
eigenvalue \ = 1, provided that x; = 0 and r2+---+ 2, = 0. 
Show also that there are two eigenvectors with r2 = --- = &p, 
and find the corresponding eigenvalues. Deduce that 


ko(A) = Hn +1) (14 /1— ate) - 


Let B € R”*” and denote by J the identity matrix of order n. 
Show that if the matrix J — B is singular, then there exists a 
nonzero vector a € R” such that (I — B)aw = 0; deduce that 
||B\| > 1, and hence that, if || Al] < 1, then the matrix I — A is 
nonsingular. 

Now suppose that A € R”*” with ||A|| < 1. Show that 


(I— A)-'=1+ A(I— A)“, 
and hence that 
(7 — A)~* | < 1+ Al] 4)? I. 
Deduce that 


“1 1 
I-A) < ay 


Let A € R"*” be a nonsingular matrix and b € R?’. Suppose 
that Ax = band (A+6A)(x+6z) = b, and that ||A~! 6A|| < 1. 
Use the result of Exercise 12 to show that 

\|6x|| ||A~* Al . 

I|z|| ~ 1— ||A~* 6A]| 


Suppose that A € R”*” is a nonsingular matrix, and b € R?. 
Given that Ax = b and A(a + 6a) = b+ 6b, Theorem 2.11 
states that 
dad 
lal] 


||6b]) 
4) Tel 


By considering the eigenvectors of ATA, show how to find vec- 
tors b and 66 for which equality is attained, when using the 
2-norm. 
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2.15 Find the QR factorisation of the matrix 


9 —-6 
A={| 12 -8 |, 
0 20 
and hence find the least squares solution of the system of linear 
equations 
9x —6by = 300, 
12x —8y = 600, 


20y = 900. 


3 


Special matrices 


3.1 Introduction 


In this chapter we show how one can modify the elimination method for 
the solution of Ax = b when the matrix A has certain special proper- 
ties. In particular when A € R”*” is symmetric and positive definite 
the amount of computational work can be halved. For matrices with a 
band structure, having nonzero elements only in positions close to the 
diagonal, the efficiency can be improved even more dramatically. 


3.2 Symmetric positive definite matrices 


Definition 3.1 The matrix A = (a;;) € R"“” is said to be symmetric 
if aij = aj; for alli and j in the set {1,2,...,n}; ie, if A= AT. The 
set of all symmetric matrices A € R"*" will be denoted by Ruy. A 
matriz A € R"*” is called positive definite if 


x! Ax >0 


for every vector « € R? = R” \ {O}. 


Example 3.1 Consider the matric A € R?™?, 


Ba oa) 


and a vector x = (1,22)? € R? = R? \ {0}. 


Clearly, x? Ax = az} + (b+ c)aix2 + dr3. The quadratic form on 
the right-hand side is positive for all real numbers 21, x2 such that 
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x = (%1,22)" # (0,0)? = 0 if, and only if, 
a>0, d>0 and (b+c)? < 4ad. 


We see that if A € R?*? is positive definite, then the diagonal elements of 
A are positive. Further, noting that the third inequality can be rewritten 
as 

(b—c)? < 4(ad — be) = 4 det(A), 


we deduce that the determinant of a positive definite matrix A € R?*? is 
positive. This, of course, is still true in the special case when A € R25, 


i.e., when b= c. © 


The next theorem extends the observations of the last example to any 
symmetric positive definite matrix A € R"*”. 


Theorem 3.1 Suppose that n > 2 and A = (aij) € RY is positive 
definite; then: 
(i) all the diagonal elements of A are positive, that is, ay > 0, for 
ee een fy 

(ii) all the eigenvalues of A are real and positive, and the eigenvectors 
of A belong to R®; 

(iii) the determinant of A is positive; 

(iv) every submatrix B of A obtained by deleting any set of rows and 
the corresponding set of columns from A is symmetric and pos- 
itive definite; in particular, every leading principal submatrix is 
positive definite; 

(v) a}, < aaj; for alli and j in {1,2,...,n} such that i 4 J; 

(vi) the element of A with largest absolute value lies on the diagonal; 

(vii) if a is the largest of the diagonal elements of A, then 


laij] <a Vi,j © {1,2,...,n}. 


Proof (i) Consider the vector a € R” with only one nonzero element, 
in position i € {1,2,...,n}. Since A is positive definite and aw € R?, it 
follows that x,a;;7; = 2? Ax > 0, and therefore a; > 0. 

(ii) Suppose that A € C is an eigenvalue of A and let a € C? = C”\ {0} 
denote the associated eigenvector. Further, let z denote the vector in 
C? whose ith element is the complex conjugate of the ith element of 
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x,i=1,2,...,n. As Ax = da, it follows that 2’ Aw = \(#T x), and 
therefore, using the symmetry of A, 


xe At=ax' A'% =(#' Ax)’ = (A#'x))' = a2). 


Complex conjugation then yields &' Aw = \(@* a), and hence \(#Ta) = 
(zl ax). As « #0, it follows that \ = ); i-e., \ is a real number. 

The fact that the eigenvector associated with has real elements 
follows by noting that all elements of the singular matrix A — XI are 
real numbers. Therefore, the column vectors of A — XI are linearly 
dependent in R”. Hence there exist n real numbers 7,...,2,, such that 
(A — AI)x = 0, where x = (21,...,2n)7. 

Finally, as Ax = \x with \ € R and a € R®, we have that «7 Ax = 
AaTae. Since \ = x’ Ax/ax™ax and A is positive definite, \ is the ratio 
of two positive real numbers and therefore also real and positive. 

(iii) This follows from the fact that the determinant of A is equal to 
the product of its eigenvalues, and the previous result. Indeed, since A 
is symmetric, there exist an orthogonal matrix X and a diagonal matrix 
A, whose diagonal elements are the eigenvalues A;, 7 = 1,2,...,n, of A, 
such that A= XTAX = X~!AX. By the Binet-Cauchy Theorem (see 
Chapter 2, end of Section 2.3), 


det(A) = det(X~') det(A) det(X) 
1 
ss. et (AY = Dain she sO 


(iv) Consider the vector « € R® with zeros in the positions corre- 
sponding to the rows which have been deleted. Then, 


x! Ax = y' By 


where B is the submatrix of A containing the rows and columns which 
remain after deletion, and y is the vector consisting of the elements of 
az which were not deleted. Since the expression on the left is positive, 
the same is true of the expression on the right, for all vectors y except 
the zero vector. Therefore B is positive definite. 

(v) By the previous result the 2 x 2 submatrix consisting of rows and 
columns r and s of A is positive definite, and its determinant is therefore 
positive. 

(vi) This follows from the previous result, since it shows that |a;,;| 
cannot exceed the greater of a;; and a;;. 

(vii) This follows at once from the previous result. 
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The converses of two of these results are also true: 


(i) If all the eigenvalues of the symmetric matrix A € R”*” are 
positive, then A is positive definite; 

(ii) If the determinant of each leading principal submatrix of a matrix 
A €R"*” is positive, then A is positive definite. 


The proof of the second result is involved and will not be given here;! 
see, however, Example 3.1 for the case of n = 2. The proof of the first 
statement, on the other hand, is quite simple and proceeds as follows. 
Since A € R”*” is symmetric, it has a complete set of orthonor- 
n 


mal eigenvectors v1,...,Un, in R?, and the corresponding eigenvalues 
A1,---,An are all real. Given any vector x € R”, it can be expressed as 


*) 
n 
i ba QAyYVi 
i=1 


where a; € R, i = 1,2,...,n, and af+-:-+a0% = ax > 0. Since 
Av; = \;v;, 7 = 1,2,...,n, it follows that 


n 
Ar = ) Qj AGV; - 
i=1 


As V7 U5 = 0 fori #j and v/v; = 1, we deduce that 


n 
giAg = y ria? 
i=1 


nm 
n 
> in A; ¢ 
> (sin) Soa; >0, 
i=1 
since minj_, 4; > 0; therefore A is positive definite. 
For a symmetric positive definite matrix A we can now obtain an LU 


factorisation A = LU in which U = L. 


Theorem 3.2 Suppose that n > 2 and A © R®*" is a positive definite 


sym 
matrix; then, there exists a lower triangular matrix L € R"*” such that 


A= TE"; 
This is known as the Cholesky factorisation? of A. 
1 For more details, see R.A. Horn and C.R. Johnson, Matrix Analysis, Cambridge 
University Press, 1992, Theorem 7.2.5. 


2 ‘André-Louis Cholesky (1875-1918) was a French military officer involved in 
geodesy and surveying in Crete and North Africa just before World War I. He 
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Proof Since A is symmetric and positive definite, all the leading principal 
submatrices of A are positive definite, and hence by Theorem 2.2 the 
usual LU factorisation exists, with 


A=LYU®, 


L® € R"*” a unit lower triangular and U@) € R"*” an upper trian- 
gular matrix. In this factorisation the product of the leading principal 
submatrices of L™ and U“ of order k is the leading principal subma- 
trix of A of order k, 1 << k <n. Since the determinant of this submatrix 
is positive and all the diagonal elements of L“) are unity, it follows 
that 


U 


a... ul? > 0, k= 2c. an. 
Thus all the diagonal elements of U“) are positive. If we now define D 
to be the diagonal matrix with elements d;; = Vu) 


a, t= 1,2,...,n, we 
can write 


A=LYU = (LY D)\(D-1U™) = LU, 
where now |;; = uy; = Jul ) The symmetry of the matrix A shows that 
LU=A=AT=U'L', 
so that 
Ue Sb, Ue . 


In this equality the left-hand side is upper triangular, and the right- 
hand side is lower triangular, and hence both sides must be diagonal. 
Therefore, U = D*L™, where D* is a diagonal matrix; but U and LT 
have the same diagonal elements, so D* = J and U = L". 

The same argument shows that L and L™ are unique, except for the 
arbitrary choice of the signs of the square roots in the definition of the 
diagonal matrix D. If we make the natural choice, taking all the square 
roots to be positive, then the diagonal elements of DL are positive, and 


the factorisation is unique. 


developed the method now named after him to compute solutions to the normal 
equations for some least squares data fitting problems arising in geodesy. His work 
was posthumously published on his behalf in 1924 by a fellow officer, Benoit, in 
the Bulletin Géodésique.’ — Cleve Moler, NA-Digest, February 18, 1990, Volume 
90, Issue 07, http: //www.netlib.org/na-digest-html1/90/v90n07 . html 
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In practice we construct the elements of L directly, rather than form- 
ing L) and U™) first. This is done in a similar way to the LU factori- 
sation. Suppose that 7 < 7; we then require that 


Ga = Se belies Ll<i<gjg<n. (3.1) 
k=1 
Note that we have used the fact that (L™),; = lj,; the sum only extends 


up to k = 7 since L is lower triangular. The same equation will also hold 
for i > j, since A is symmetric. For 7 = j, equation (3.1) gives 


a 1/2 
liy = ail’, lig = {- Tal ‘ l<i<n. (3.2) 
k=1 


As A is a positive definite matrix, a}, > 0 and therefore J; is a positive 
real number. Further, as we have seen in the proof of the preceding 
theorem, l;; > 0, i = 2,3,...,n. We find similarly that 


i-1 
1 Hed 
a k=l 


These equations now enable us to calculate the elements of DL in succes- 
sion. For each i € {1,2,...,n—1}, we first calculate l,;; from (3.2), and 


then calculate Jj41;, litoi, ---,lnj from (3.3). Finally, we compute Ip, 
using (3.2). 
As, by hypothesis, the matrix A € Ri is positive definite, the re- 


quired factorisation exists, so we can be sure that the divisor 1;; in (3.3), 
and the expression in the curly brackets in (3.2) whose square root is 
taken, will be positive. Thus, (3.2) implies that 


9 i-1 49 . 
Uf, = a1, max li, < ais, 4 = 2, Seep IVs 


The elements of the factor L cannot therefore grow very large, and no 
pivoting is necessary. 

The evaluation of 1;; from (3.2) requires i — 1 multiplications, 7 — 1 
subtractions and one square root operation, a total of 27 — 1 operations. 
The calculation of each 1;; from (3.3) also requires 2i — 1 operations. 
The total number of operations required to construct L is therefore 


n n n 


Dd 2i- 1) = > (2 1)(1+n—4) = An(nt1)(2n+1). 
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For large n the number of operations required is approximately n°, 
which, as might be expected, is half the number given in Section 2.6 for 
the LU factorisation of a nonsymmetric matrix. 


3.3 Tridiagonal and band matrices 


As we shall see in the final chapters, in the numerical solution of bound- 
ary value problems for second-order differential equations one encounters 
a particular kind of matrix whose elements are mostly zeros, except for 
those along its main diagonal and the two adjacent diagonals. Matri- 
ces of this kind are referred to as tridiagonal. In order to motivate the 
definition of tridiagonal matrix stated in Definition 3.2 below, we begin 
with an example which is discussed in more detail in Chapter 13. 


Example 3.2 Consider the two-point boundary value problem 


d2 
— sa trey = f(@), 2€ (01), 


y(0) = 0, y(1)=0. 
where r and f are continuous functions of x defined on the interval [0, 1]. 


The numerical solution of the boundary value problem proceeds by se- 
lecting an integer n > 4, choosing a step size h = 1/n, and subdividing 
the interval [0,1] by the points xz, = kh, k = 0,1,...,n. The numerical 
approximation to y(x,), the value of the analytical solution y at the 
point + = xz, is denoted by Y;. The values Y; are obtained by solving 
the set of linear equations 


Ya. — 2Y~ + Yee 
= = “" + r(ae)¥e = f(a) 


fork =1,2,...,n—1, together with the boundary conditions 


Yo=0, Y,=0. 
Equivalently, 


akYe—-1 + ChYe + OeYe41 = dk, k=1,2,...,.n—1, 
Yo = 0, Y¥,=0, 
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where 
ay = by = —1/h?, cy = 2/h? + r(az), dy = f(xr), 


fork =1,2,...,n—1. 
Clearly, for 1 < k < n—1, the kth equation in the linear system above 
involves only three of the n — 1 unknowns: Yz_-1, Y, and Yxi1.- © 


The example motivates the following definition of a tridiagonal (or 
triple diagonal) matrix. 


Definition 3.2 Suppose that n > 3. A matric T = (tj) € R"*” is said 
to be tridiagonal if it has nonzero elements only on the main diagonal 
and the two adjacent diagonals; 1.e., 


tem Or af JES dy PE Dang 
Such matrices are also sometimes called triple diagonal. 


It is easy to see that in the LU factorisation process of a tridiagonal 
matrix T € R"*”, without row interchanges, the unit lower triangular 
matrix L € R"*” and the upper triangular matrix U € R"*” each have 
only two elements in each row. Writing T in the compact notation 


by C1 
a2 bo C2 
T= a , (3.4) 
An bn 
the factorisation may be written T = LU where 
1 
i) 1 
L= Is 1 (3.5) 
In 1 
and 
U1 U1 
U2 v2 
U= Ug U3 , (3.6) 


Un 
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with the convention that the missing elements in these matrices are all 
equal to zero. It is often convenient to define a; = 0 and c, = 0. 
Multiplying ZL and U shows that v; = c;, and that the elements /; and 
u; can be calculated from 


L, =a,;/uj-1, uj = b; —lje;_1, 9S 27h ag hy (3.7) 


starting from wu, = b. 

Let us suppose that our aim is to solve the system of linear equations 
Tax =r, where the matrix T € R”*” is tridiagonal and nonsingular, and 
r € R”. Having calculated the elements of the matrices L and U in the 
LU factorisation T = LU using (3.7), the forward and backsubstitution 
are then also very simple. Letting y = Uz, the equation Ly = r gives 


y= 1, (3.8) 
Yi = ry —byyj-1, j= 2,3,...,n, (3.9) 
and finally from Ux = y we get 


Ln = Yn/Un,; (3.10) 
w= (yj — Up Lj41)/U; , j=n—1,n-2,...,1. (3.11) 


The LU factorisation of a tridiagonal matrix requires approximately 
3n operations. The forward and backsubstitution together involve ap- 
proximately 5n operations. Thus, the whole solution process requires 
approximately 8n operations. The total amount of work is therefore 
far less than for a full matrix, being of order n for large n, compared 
with 2n3 for a full matrix. The method we have described is a minor 
variation on what is often known as the Thomas algorithm.' 

So far we have assumed that pivoting was not necessary; clearly any 
interchange of rows will destroy the tridiagonal structure of T. However, 
it is easy to see that the only interchanges required will be between two 
adjacent rows. 


Theorem 3.3 Suppose that n > 3 andT € R"*” is a tridiagonal matriz; 
then, there exists a permutation matrix P € R"*” such that 


PA=LYU (3.12) 


1 After Llewellyn H. Thomas, a distinguished physicist, who in the 1950s held po- 
sitions at Columbia University and at IBM’s Watson Research Laboratory. He is 
probably best known in connection with the Thomas—Fermi electron gas model. 
The terminology ‘Thomas algorithm’ comes from David Young. Thomas, L.H., 
Elliptic Problems in Linear Difference Equations over a Network, Watson Sci. 
Comput. Lab. Rept, Columbia University, New York, 1949. See NA-Digest V.96, 
09, http://www.netlib. org/cgi-bin/mfs/02/96/v96n09. html 
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where L) € R"*” is unit lower triangular with at most two nonzero 
elements in each row, and U} € R"*” is upper triangular with at most 
three nonzero elements in each row. 


The proof of this theorem is left as an exercise (see Exercise 6). It 
shows that the effect of pivoting is at worst to lead to an additional 
superdiagonal in the upper triangular factor. 

In an important class of problems it is also easy to show that pivoting 
is unnecessary. We have shown this to be true for a symmetric positive 
definite matrix, and we can now show that it is also true for a tridiagonal 
matrix which is strictly diagonally dominant. 


Definition 3.3 A matriz A € R"*" is said to be diagonally dominant 
if 
n 
lai] > S¢ lasgl, G1, Deals 

j=l 

Fi 
A is said to be strictly diagonally dominant jf strict inequality holds 
for each 1. 


Theorem 3.4 Suppose that n > 3, T € R"*” is tridiagonal, as in (3.4), 
and 


|bj| > |a;| + Je,|, Pa 2g (3.13) 


(with the convention a, = 0, Cn = 0); then T is nonsingular, and it can 
be written, without pivoting, in the form T = LU where L € R"*” is 
unit lower triangular and U € R"*” is upper triangular. The condition 
(3.13) ensures that the matrix T is strictly diagonally dominant. 


Proof We first show by induction that |u;| > |c;| for all j = 1,2,...,n. 
This inequality trivially holds for 7 = 1 since 
[ua] = |bi| > Jaa] + lea] = |eal. 


Now let j € {2,...,n} and adopt the inductive hypothesis: 


HYPj—1' |ue—-1| > |ce-1] WR {2,..., 5}. 
(As we have already seen, Hyp, is true.) Then, from (3.7) we see that 
Cj-1 
uj| 2 b;| — ja;| | | 
jl = | isl —lay| [= 
2 | [by] — les | > lesl (3.14) 
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by the condition of strict diagonal dominance (3.13), which then shows 
that Hyp; holds. That completes the inductive step. 

We have thus proved that |u;| > |c;| for all 7 = 1,2,...,n. In par- 
ticular, we deduce that u; # 0 for all j € {1,2,...,n}; hence the LU 
factorisation T = LU defined by (3.7) exists. Further, 


det(T) = det(L) det(U) = det(U) = ujug...Un £0, 


so T is nonsingular. 


The formula (3.7) and the inequalities |u,;| > |c;|, 7 = 1,2,...,n, now 
imply that 
uj] << [bs] + [5] Jez—al 
= [byl + lajej—a]/[uj—-a 
< |bj| + la;l, J=1,2,...,n, (3.15) 


so the elements u; cannot grow large, and rounding errors are kept under 


control without pivoting. 


It is easy to see that the same result holds under the weaker assump- 
tion that the matrix is diagonally dominant, but not necessarily strictly 
diagonally dominant, provided that we also require that all the elements 
cj, j =1,2,...,n—1, are nonzero (see Exercise 5). 

Note also that the matrix constructed in Example 3.2 satisfies this 
condition, provided that the function r is nonnegative; this often holds 
in practical boundary value problems. 

If the matrix T € R”*” is symmetric and positive definite, as well as 
tridiagonal, it can be factorised in the form T = LL™, where L € R”*" is 
lower triangular with nonzero elements only on and immediately below 
the diagonal. If we use the notation d; = 1, e; = [;;-1 we easily find 
from (3.2) and (3.3) that the elements can be calculated in succession 
from the following formulae: 


dy = By”, 
eq = Cepia dj = ( — 2)”, 1 =2,3,...,n. 


This calculation involves about 4n operations. Including also the work 
required by the forward and backsubstitution stages, the complete so- 
lution of Ta = 6 will be found to involve about 10n operations. For 
the tridiagonal matrix the Cholesky factorisation method thus requires 
more work for the complete solution than the Thomas algorithm; in this 
case there is no particular advantage in exploiting the symmetry of the 
matrix in this way. 
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x ¥ * * 
x ¥ *¥ * 
x * * * 
x ¥ * * 
x *¥ * * 
x ¥ * * 
x *¥ * * 


Fig. 3.1. The asterisks indicate the 36 nonzero elements in this 10 x 10 
Band(1,2) matrix. 


More generally, a system of equations may often involve a matrix of 
band type. 


Definition 3.4 B € R”*” is a band matrix if there exist nonnegative 
integers p <n and q <n such that b;; = 0 for all i,j € {1,2,...,n} 
such that p <i-— jg orq<j-—1t. The band is of width p+q+1, with 
p elements to the left of the diagonal and q elements to the right of the 
diagonal, in each row. Such a matrix is said to be Band(p, q). 


Thus, for example, a tridiagonal matrix is Band(1,1), and an n x n 
lower triangular matrix is Band(n — 1,0). 

An example of a Band(1,2) matrix A € R'°*?° is shown in Figure 3.1, 
where each nonzero element in the matrix is identified by an asterisk. 
In addition to its main diagonal, the matrix has nonzero elements on its 
lower subdiagonal and two of its superdiagonals. 

It is easy to see that, provided that no interchanges are necessary, 
such a band matrix can be written in the form B = LU, where L is 
Band(p,0) and U is Band(0,q) (see Exercise 7). It is also fairly simple to 
count the operations required in this calculation; the result is approx- 
imately proportional to np(p + 2q) when n is moderately large. The 
most common situation has q = p, and then the number of operations 
is approximately proportional to np?. As in the tridiagonal case, this is 
much smaller than n? when p and q are fairly small compared with n. 


3.4 Monotone matrices 


If a positive real number a is increased by ¢ > 0 to a+, then its 
reciprocal a~! decreases to (a+ ¢)~+. It is not usually true, however, 
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that if we increase some or all of the elements of a nonsingular matrix 
A € R”*”, then the elements of the inverse 47! € R"*” will decrease. 
This useful property holds for the class of monotone matrices defined 
below. 

The discussion in this section is not related to Gaussian elimination 
and LU factorisation, but it is of relevance in the iterative solution of 
systems of linear equations with monotone matrices which arise in the 
course of numerical approximation of boundary value problems for cer- 
tain ordinary and partial differential equations. 


Definition 3.5 The nonsingular matrix A € R"*” is said to be mono- 
tone if all the elements of the inverse A~* are nonnegative. 


Example 3.3 Suppose that a and d are positive real numbers, and b and 
c are nonnegative real numbers such that ad > bc. Then, 


a=($ 4) 


is a monotone matrix. This is easily seen by considering the inverse of 


the matrix A, 
1 d b 
Al= 
ad — be ( OG ) , 


and noting that all elements of A~! are nonnegative. 


Next we introduce the concept of ordering in R” and R”*”. 


Definition 3.6 For vectors x and y in R” we use the notation 
ury 
to mean that 
Dee es, 7=1,2,... 


sn. 


In the same way, for matrices A and B in R"*” we write 
AB 
to mean that 
aig > bi; , ty SL Devik 


The sign > is read ‘succeeds or is equal to’ or, simply, ‘is greater than 
or equal to’. 
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Note that, given two arbitrary matrices A and B in R”*”, in general 
none of A> B, A= B and B > A will be true. Therefore the relation 
> is a partial, rather than a total, ordering on R”*”; the same is true 
of the ordering = on R”. 


Theorem 3.5 (i) Suppose that the nonsingular matriz A € R°*” is 
monotone, b,c € R”, and the vectors x and y in R” are the solutions 


of 


respectively. [fb = c, thena = y. 


(it) Suppose that A and B are nonsingular matrices in R"*” and that 
both are monotone. If A> B, then B~! > Aw}. 


Proof (i) Since the elements of A~! are nonnegative and 
w-Yy= A7!(b—c), 


the result follows from the fact that all elements of the vector A~!(b—c) 
appearing on the right-hand side of this equality are nonnegative. 

(ii) Since A > B and all the elements of B~' are nonnegative, it 
follows that 


BUA> BUB=T. 


In the same way, since all the elements of A~! are nonnegative, it follows 
that 


Bi1=B1AA'> «At, 


as required. 


The following theorem will be useful in Chapter 13. 
Theorem 3.6 Suppose that n > 3 and T € R”"*” is a tridiagonal 
matriz of the form (3.4) with the properties 
a,<0, *4=2,3,...,n, qg<0, #=1,2,....n-1, 
and 
a, tb+c > 0, eel erat Or 


where we have followed the convention that a, = 0, cn = 0; then, the 
matriz T is monotone. 
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Proof Let k € {1,2,...,n}. Column k of the inverse T~! is the solution 
of the linear system Te) = e), where e“) is column k of the identity 
matrix of size n, having a single nonzero element, 1, in row k. By 
applying the Thomas algorithm to this linear system, it is easy to deduce 
by induction from (3.7) that 1; < 0, uj; > 0 and v,; < 0 for all j; the 
argument is very similar to the proof of Theorem 3.4. It then follows 
from (3.8) and (3.9) that, in the notation of the Thomas algorithm, the 
vectors y and x have nonnegative elements. Hence column k of the 
inverse T~+ has nonnegative elements. Since the same is true for each 
k € {1,2,...,n}, it follows that T is monotone. 


3.5 Notes 


Symmetric systems of linear algebraic equations arise in the numerical 
solution of self-adjoint boundary value problems for differential equa- 
tions with real-valued coefficients. 

For further details on the Cholesky factorisation, the reader may con- 
sult any of the books listed in the Notes at the end of Chapter 2, partic- 
ularly Chapter 10 of N.J. Higham, Accuracy and Stability of Numerical 
Algorithms, SIAM, Philadelphia, 1996. 

Classical iterative methods for the solution of systems of linear equa- 
tions with monotone matrices are discussed, for example, in 


» RICHARD S. VaRGA, Matriz Iterative Analysis, Prentice-Hall, Engle- 
wood Cliffs, NJ, 1962. 


A more recent reference on iterative algorithms for linear systems is 


» OWE AXELSON, Iterative Solution Methods, Cambridge University 
Press, Cambridge, 1996. 


In particular, Chapter 6 of Axelson’s book considers the relevance of 
monotone matrices in the context of iterative solution of systems of 
linear equations. 

Theorem 3.6 is a slight variation on the following general result. 


Theorem 3.7 A sufficient condition for A € R"*” to be a monotone 
matrix is that A is an M-matrix, that is, (a) aij < 0 for all i,j € 
{1,2,...,n} such that i # j, and (b) there exists a vector g € R” with 
positive elements such that all elements of Ag € R” are positive. 
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Exercises 


Find the Cholesky factorisation of the matrix 


4 6 2 
A={6 10 8 
2 3 = «5 
Use the method of Cholesky factorisation to solve the system of 
equations 
Ly — 229 Tr 223 => 4, 
—24,+5%2-3%3 = -T, 
2241 —_ 3X2 +r 6x3 = 10. 


Let n > 3. The n x n tridiagonal matrix T has the diagonal 
elements 


fee am eal ae 
and the off-diagonal elements 
Een Teg 1,. b= 1, 2). 1: 


In the factorisation T = LU, where L € R”*” is unit lower 
triangular and U € R”*” is upper triangular, show that 


Digit = —i/(¢4+1), a ee eer 2r 


and find expressions for the elements of U. What is the deter- 
minant of T? 

Let n > 3 and1<k <n. Define the vector v‘*) © R” with 
elements given by 


) = i(n+1—k), = 19 ek, 
bP | knt1-4, i=k+1,...,n. 


a 


Evaluate M;,;, the inner product of the vector v*) with column j 
of the matrix T defined in Exercise 3. (The inner product (v, w) 
of two vectors v and w in R” is defined as the real number 
v'w.) Hence give expressions for the elements of the inverse 
matrix T~!, and verify that this inverse is symmetric. Find the 
co-norm of the inverse, ||7~1||.., and show that the condition 
number of T is 


1 
Koo (T) = aint 1)?,  nodd. 


What is the condition number k..(T’) when n is even? 


3.5 


3.6 


3.7 


3.8 
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Given that n > 3, in the notation of Theorem 3.4 suppose that 
|b;| > |a;| + le;|, ee tae eee Le 
and 
lej]>0, g=1,2,...,n-1, 


with the convention that aj = 0 and c, = 0. Show that the 
factorisation T = LU exists without pivoting, and can be con- 
structed by the Thomas algorithm. Give an example of a matrix 
T which satisfies these conditions, except that c, = 0 for some 
k € {1,2,...,n—1} and such that T is singular and cannot be 
written in the form T = LU without pivoting. 

Let n > 3 and suppose that the matrix T € R”*” is tridiagonal. 
Show that there exists a permutation matrix P € R"*” such 
that 


PA=LYyY 


where L € R"*” is unit lower triangular with at most two 
nonzero elements in each row, and U“) € R"*” is upper trian- 
gular with at most three nonzero elements in each row. 
Suppose that the matrix B is Band(p,q), and that there exists 
a factorisation B = LU without row interchanges. Show that L 
is Band(p,0) and U is Band(0,q). 

Suppose that n > 4, that the matrix A € R”*” is Band(3,3), 
and has the LU factorisation A = LU, so that L € R”*” is 
Band(3,0) and U € R”*” is Band(0,3). Suppose also that 
i424, = 0, A442 = 0 for i = 1,2,...,n —2. By considering 
Uo4 and I49, or otherwise, show that in general the elements 
lj4o,; and u;i42 are not zero. 


A 


Simultaneous nonlinear equations 


4.1 Introduction 


In Chapter 1 we discussed iterative methods for the solution of a single 
nonlinear equation of the form f(a) = 0 where f is a continuous real- 
valued function of a single real variable. In Chapters 2 and 3, on the 
other hand, we were concerned with direct (as opposed to iterative) 
methods for systems of linear equations. The purpose of the present 
chapter is to extend the techniques developed in Chapter 1 to systems of 
simultaneous nonlinear equations for functions of several real variables. 
We shall concentrate on two methods: the generalisation of simple itera- 
tion, usually referred to as simultaneous iteration, and Newton’s method. 

Given that # = (21,...,@,)" € R”, as in Chapters 2 and 3 we denote 
by ||a||.. the co-norm of a defined by 


lI2l]oo = max |x| . 
t=1 


Throughout the chapter, R” will be thought of as a linear space equipped 
with the co-norm; with only minor alterations all of our results can be 
restated in the p-norm with p € [1,0o) on replacing || - ||. by || - |lp 
throughout. We begin with some basic definitions which involve the 
concept of open ball defined in Section 2.7. 

Let € € R”; the open ball in R” (with respect to the oo-norm) of 
radius € > 0 and centre € is defined as the set 


B.(€) = {@ € R”: lla — Ello < €}. 


A set D C R” is said to be an open set in R” if for every € € D there 
exists € = €(€) > 0 such that B.(€) C D (see Figure 4.1). For example, 
any open ball in R” is an open set in R”. Given € € R”, any open set 
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Fig. 4.1. Open set D: for each € € D there exists ¢ = ¢(€) such that the open 
ball B-(€) of radius € and centre € is contained in D. 


N(€) Cc R” containing € will be called a neighbourhood of €; thus, 
any open set in R” is a neighbourhood of each of its elements. 

A set D C R" is said to be a closed set in R” if its complement 
IR” \ D is an open set in R”. For example, the closed ball of radius ¢ > 0 
and centre €, defined by 


B.(§) = {@ ER": lla — €lloo Se}, 


is a closed set in R”. 
A sequence (a“*)) C R” is called a Cauchy sequence in R” if for 
any € > 0 there exists a positive integer ko = ko(€) such that 


Jr — |, <eé Vk,m> ko. 


We shall make use of the fact that R” is complete: that is, if (a) 
is a Cauchy sequence in R”, then there exists € in R” such that (a) 
converges to &; 7.e., 


Jim Ja) — €|],, =0. (4.1) 
For the sake of brevity, we shall write lim,_... «) = € instead of (4.1). 


Lemma 4.1 Suppose that D is a nonempty closed subset of R” and 
(a*)) C D is a Cauchy sequence in R”. Then, limp... «\”) = € exists 
and& € D. 


Proof As (a) is a Cauchy sequence in R”, there exists € € R” such 
that limp... #2) = €. It remains to prove that € € D. Suppose, 
otherwise, that € belongs to the open set R” \ D. Then, there exists 
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€ > O such that B.(€) C R"\ D. As (a) C D, no member of the 
sequence (a“*)) can enter B.(€). This, however, contradicts the fact 
that (a“")) converges to €. The contradiction implies that € € D. 


Suppose that D is a nonempty subset of R” and f: D(C R”) — R” 
is a function defined on D. Given that € € D, we shall say that f is 
continuous at € if for every « > 0 there exists 6 = 6(€) > 0 such that, 
for every « € Bs(€) ND, 


IIf(@) — F(E)lloo <e. 


When a function f, defined on the set D, is continuous at each point of 
D, it is said to be a continuous function on D. 


Lemma 4.2 Let D be a nonempty subset of R” and f: D(C R”) — R” 
a function, defined and continuous on D. If (a) C D converges in R” 
to € € D, then limp. f(a) = F(€). 


Proof Due to the continuity of f at € € D, given e > 0, there exists 
6 = 6(€) > 0 such that if ||Ja — €||,,. < 6 for some x € D, then 


IIf(@) — FE) Ilo <€. (4.2) 


Further, as (a*)) converges to €, there exists ky = ko(6) = ko(6(e)) such 
that 


je —El]o<6 Vk>ko. 


Hence, taking a = 2") in (4.2), we deduce that for each ¢ > 0 there 
exists ko such that 


Ife) —F(llo<e VWk>ko, 
which means that limp... f(a)) = f(€). 


After this brief preparation, we are ready to embark on the develop- 
ment of numerical algorithms for the solution of systems of simultaneous 
nonlinear equations. 


4.2 Simultaneous iteration 


Let D be a nonempty closed subset of R” and f: D(C R”) — R” 
a continuous function defined on D. We shall be concerned with the 
problem of finding € € D such that f(€) = 0. If such € exists, it is 
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Fig. 4.2. Graphs of the curves xvi +23 —1=0 and 527 + 212? —9 =0. 


called a solution to the equation f(x) = 0 (in D). When written in 
componentwise form, f(a) = 0 becomes 
FAB Bn) 203 t=1,...,n, 


a system of n simultaneous nonlinear equations for n unknowns, where 
fi,---,fn are the components of f. 


Example 4.1 Consider the system of two simultaneous nonlinear equa- 
tions in two unknowns, x, and x2, defined by 
ci+a3—-1 = 0, 
hat +2le,-—-9 = 0. 

Here x = (21,22)' and f = (fi, fo)? with 

filei,22) = a? +a3-1, 

fo(ai,z2) = 5a?+2122-9. 
The equation f(x) =0 has four solutions: 


f= (—V3/2,1/2)7, £2 = (¥3/2,1/2)", 

3 = (-v3/2,—-1/2)", &4 = (v3/2,-1/2)". 
The curves fi (21,22) =0 and fo(a1, 72) = 0 are depicted in Figure 4.2. 
The four solutions correspond to the four points of intersection of the 
two curves in the figure. 
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Example 4.2 Let us suppose that A € R"*” and b € R”. On letting 
f(x) = b— Ax we deduce that the problem of solving the system of 
simultaneous linear equations considered in Chapters 2 and 3 can be 
restated in the form: find a € R” such that f(x) = 0. 


Let us assume that we have transformed the equation f(a) = 0 into an 
equivalent form g(x) = x, where g: R” — R” is a continuous function, 
defined on the closed subset D C R”, such that g(D) C D. For example, 
one can choose g(a) = « — af(x), with a € R a suitable parameter. 
By ‘equivalent’ we mean that € € D satisfies f(€) = 0 if, and only if, 
g(€) = €. Any € € D such that g(€) = € is called a fixed point of the 
function g in D. Thus the problem of finding a solution € € D to the 
equation f(x) = 0 has been converted into one of finding a fixed point 
in D of the function g. We embark on the latter task by considering the 
natural extension to R” of the simple iteration discussed in Section 1.2 
for the solution of the scalar nonlinear equation g(x) = x. 


Definition 4.1 Suppose that g: R” — R” is a function, defined and 
continuous on a closed subset D of R", such that g(D) C D. Given that 
xo € D, the recursion defined by 


alk+)) — g(a(*)), k=0,1,2,..., (4.3) 


is called a simultaneous iteration. For n = 1 the recursion (4.3) is 
just the simple iteration considered in (1.3). 


Note that here we use the superscript k as the sequence index; follow- 
ing the convention adopted in Chapters 2 and 3, we reserve subscripts 
for labelling the entries of vectors. Thus a) is entry 7 of the vector 
x"), the kth member of the sequence (a). The motivation behind 
the definition of the simultaneous iteration (4.3) is, of course, our hope 
that, under suitable conditions on g and D, the sequence (a) will 
converge to a fixed point € of g. 

Two remarks are in order at this point. First, it is easy to show that if 
a sequence of vectors (a *)) converges in R” to in the norm ||-||,., then 
it also converges to this same limit in the norm || - ||, for any p € [1, oo). 
To see this, note that 


[|wWlloo S [wllp <r*/?|\wlo VweR", (4.4) 
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for 1 < p < o, and take w = a") — € to deduce that, as k — oo, 
convergence in the oo-norm implies convergence in the p-norm for any 
p € [1,co), and vice versa. Thus, in this sense, the choice of norm on 
R” is irrelevant. Second, the assumption that D is a closed set is crucial 
in our discussion. If D is not closed, g: D — D need not have a fixed 
point in D, even if 2) € D for all k > 0 and (a) converges in R”. 
We verify this claim through a simple example. 


Example 4.3 Suppose that D is the open unit disc in R? in the oo-norm, 
which is just the square defined by -1 < 41 <1, -1 < 42 <1. Consider 
the simultaneous iteration defined by (4.3), where 2 =0 € D, and 


g(a) = F(@+u), us ce ae 


If ||zl| < 1 it is easy to see that ||g(x)||.. < 1; hence, starting the 
iteration a+) = g(a*)) from «© = 0, it follows that «) € D for all 
k > 0. The definition of g implies at once that 


ght!) _ y = L(ap(*) —u), 
and therefore 


k+1 k+1 
2D — alloc = Slee = ules = += (3)'7? x — aloo = (3) 


ef 


from which it is obvious that the sequence (a)) converges in R? to the 
limit u. However, u ¢ D, since wu lies on the unit circle in the oo-norm 
that represents the boundary of the open set D. © 


Up to now we have been assuming that the function g: R” — R” is 
defined and continuous on a closed subset D of R”. In order to ensure 
that g has a (unique) fixed point in D, we strengthen our hypotheses on 
the function g. 


Definition 4.2 Suppose that g: R” — R” is defined on a closed subset 
D of R”. If there exists a positive constant L such that, 


lg(z) — gy)lloo < LI||x— ylloo (4.5) 


for alla and y in D, then we say that g satisfies a Lipschitz condition 
on D in the co-norm. The number L is called a Lipschitz constant 
for g in the co-norm. In particular, if L € (0,1), then g is said to be a 
contraction on D in the co-norm. 
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Any function g that satisfies a Lipschitz condition on a set D is con- 
tinuous on D. For let a € D and € > 0; then, on defining 6 = </L, we 
deduce from (4.5) that if ||a — roll. < 6 for some a € D, then 


lg(@) — g(20)|loo < L||x — Lolloo <e. 


It follows from (4.4) that if g satisfies a Lipschitz condition on D in 
the oo-norm then it also does so in the p-norm for any p € [1,co), and 
vice versa. However, in general, the size of the constant L may depend 
on the choice of norm. Specifically, if g is a contraction on a set D in the 
oo-norm (i.e., (4.5) holds with L < 1), then g need not be a contraction 
in the p-norm, unless L < n~!/?, (See Exercise 1.) Conversely, if g is a 
contraction on D in the p-norm for some p € [1, 00), it does not follow 
that g is a contraction on D in the oo-norm. 

For example, suppose that g: R? — R? is the linear function defined 
by g(x) = Aw, where A is the 2 x 2 matrix 


(3) 


This function g satisfies a Lipschitz condition on R? in || - ||, for any 
p € {1, co], and if L is a Lipschitz constant for g in the p-norm, then 
L > ||Al|p, in the subordinate matrix norm. It is easy to see that || Al]; = 
|| Allo = 13/12, and a small calculation gives ||A|]z2 = 0.935 to three 
decimal digits. Hence the function g is a contraction in the 2-norm, but 
not in the 1- or oo-norm. 

Our next result is a direct generalisation of Theorem 1.3 formulated 
in Chapter 1. 


Theorem 4.1 (Contraction Mapping Theorem) Suppose that D 
is a closed subset of R", g: R” — R” is defined on D, and g(D) Cc D. 
Suppose further that g is a contraction on D in the co-norm. Then, g 
has a unique fired point € in D, and the sequence (a) defined by (4.3) 
converges to € for any starting value v2 € D. 


Proof Assuming that g has a fixed point € in D, the uniqueness of the 
fixed point is easy to show: for suppose that 77 is also a fixed point of g 
in D. Then, by (4.5), 


IE — Mlloo = Il9(€) — g(Mlloo S LIE — Mleo 5 
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e., (1— L)|JE— nll. < 0. Since L € (0,1), and || - ||. is a norm, it 
follows that € — 7 = 0, and hence € = 7. Consequently, if g has a fixed 
point in D, then this is the unique fixed point of g in D. 

Now, still assuming that g possesses a fixed point € € D, we shall 
show that the sequence (a‘*)) defined by (4.3) converges to € for any 
starting value #) € D. By repeating the argument from Chapter 1 
which led to (1.10), with the absolute value sign | - | replaced by || - ||oo 
throughout, we find that 


2 — oo < LE |x — 2 


1-L 
As L € (0,1), we deduce that limz_... L* = 0, and hence, 

aie ja — loo = 0, 
showing that the sequence (a‘")) defined by (4.3) converges to € for any 
starting value #) € D. In particular, if ¢ > 0, then letting 
- In || 21 — Zo |loo — In(e(1 — L)) 


ko = ko(e) in(1/D) +1, (4.6) 
we find that 
Ee s zlle® _ Io <e 
for all k > ko(e), and therefore 
jz — Ello Se, (4.7) 


for all k > ko(e), as in Chapter 1. A brief comment on the notation: in 
(4.6), [z] denotes the integer part of the real number 2; i.e., [a] is the 
largest integer such that [x] < a — just as in Theorem 1.4. 

In order to complete the proof of the theorem, it remains to show 
the existence of a fixed point € € D for g. In contrast with the proof 
of existence of a fixed point for a real-valued function of a single real 
variable presented in Chapter 1, here we cannot rely on the Intermediate 
Value Theorem (unless, of course, n = 1), so we shall develop a different 
argument. The essence of this will be to show that (a")) C D is a 
Cauchy sequence in R”; for then we can apply Lemmas 4.1 and 4.2 to 
deduce that the sequence converges to a fixed point € of the function g. 

Let us begin by noting that since g(D) C D, if x) belongs to D, then 
x") = g(a™*-)) € D for all k > 1. Further, since g is a contraction on 
D in the co-norm, we have that 


2 = 2D foo = [lg(@*) — g(w®}Iloo S Elfae*Y — 2 I0g 
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for all k > 2. We then deduce by induction that 
[jx — eF®—D], < FeO —2O], k>1. (4.8) 


Suppose that m and k are positive integers and m >k+1. Then, by 
repeated application of the triangle inequality in the oo-norm and using 
(4.8), we have that 


jo) — oI, I) (a — al™—-D) 4. 4 (FFD — pl), 


I 


& ja = aD it | este jx) = el) 
S (EMT 4 + LE) 2 — 2 
= TEMES. + 1)lfae — 2 hoo 

1 
< pk (1) _ (©) ‘ 
< LF |e — 20, (4.9) 


where, in the transition to the last line, we made use of the fact that the 
geometric series 1+ 2+ L? +---, with L € (0,1), sums to 1/(1 — L). 

As limg_... L* = 0, it follows from (4.9) that (a*)) is a Cauchy se- 
quence in R”; that is, for each « > 0 there exists ko = ko(e) (defined by 
(4.6) above) such that 


jx —2 |<  Vm,k> ko = hole). (4.10) 


Any Cauchy sequence in R” is convergent in R”; consequently, there 
exists € € R” such that € = lim,.., 2“). Further, since g satisfies 
a Lipschitz condition on D, the discussion in the paragraph following 
Definition 4.2 shows that g is continuous on D. Hence, by Lemma 4.2, 


€= lim ght) — jim g(a) = g (tim 2) =g(€), 
which proves that € is a fixed point of g. 
It remains to show that € € D. This follows from Lemma 4.1 since 
(a) CD, € =limg_... x) and D is closed. 


As a byproduct of the proof, we deduce from (4.7) that, given a posi- 
tive tolerance ¢, one can compute an approximation # *) to the unknown 
solution € using (4.3) in no more than kg = ko(e) iterations so that the 
approximation error € — 2°"), measured in the oo-norm, is less than ¢; 
the integer ko(e) is defined by (4.6). 

The next theorem relates the constant D from the Lipschitz condition 
(4.5) to the partial derivatives of g, giving a more practically useful 
sufficient condition for convergence. 
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Definition 4.3 Let g = (g1,...,gn)': R" — R” be a function defined 
and continuous in an (open) neue ed N(6) of € € R”. Suppose 
further that the first partial derivative ,n, of gi exist at 
€ fori =1,...,n. The Jacobian piedriy J 9 (&) ohe at € is thenxn 
matrix with elements 


Ig(E)ig = 


Ogi 


Ba, °°)? 659. S18 6 3 


Theorem 4.2 Suppose that g = (g1,---,9n)': R" — R” is defined 
and continuous on a closed set DC R”. Let € . D be a fixed point of 
g, and suppose that the first partial derivatives 2 Oe, ,j=l,...,n, of 9, 
i= 1,...,n, are defined and continuous in some (open) neighbonrtiond 
N(E) Cc D of &, with 


IlJg(E)Iloo < 1. 


Then, there exists ¢ > 0 such that g(B-(€)) C B.(€), and the sequence 
defined by (4.3) converges to € for all x € B.(€). 


Proof The proof is a natural extension of that of Theorem 1.5. We 
write K = ||Jg(&)||.0- Since the partial derivatives oo tg = Lye; 
are continuous in the neighbourhood N(€) of €, we can find a closed ball 
B.(€) C N(€) C D of radius € and centre € such that 


IJo(Z)lloo $ 9(K +1) <1 VzE B.(€). (4.11) 


Now, suppose that x and y are both in B.(€) and, for i € {1,...,n} 
fixed, define the function t + y;(t) of the single variable ¢ € [0, 1] by 


yi(t) = gi(ta + (1—t)y); 


thus, yi(0) = gi(y) and y,(1) = gi(a). The function t + y,(t) has a 
continuous derivative in ¢ on the interval [0,1]; thus, by the Mean Value 
Theorem (Theorem A.3), there exists 7 € (0,1) such that 


gi(@) — gily) = vi(1) — 9i(0) = yi (n)(1 — 0) = gi (n). 


This means that 
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fori =1,...,n. Now |x; — yj| < ||v— yll.o for all j € {1,...,n}, and so 
(4.12) gives 


n 


Iz — ylloo >> 


j=l 
< lle — yhoo IlFq(na + (1 — m)¥) Ihe » 


for alli =1,...,n. Consequently, for any x,y € B.(€), 


Og: 
Ba; (nx + (1 —)y) 


IA 


lgi(x) — gi(y)| 


llg(z) -—g(Wlloo  < ata | Ja (tz + (1 — t)y)|looll® — ylloo 


3(L+ K)||a — ylloo, (4.13) 


due to (4.11), given that ta +(1—t)y € B.(€) for all t € [0,1]. It follows 
that g satisfies a Lipschitz condition (4.5), in the oo-norm, on the closed 
ball B.(€) with L = $(1+ K) < 1. Furthermore, on selecting y = € in 
(4.13) we get that 


Ilg(x) — §lloo = Ilg(#) — g(8)lloo < |]# — €lloo S € 


for all w € B.(€). Hence, g(B-(€)) C B-(€). The convergence of the 
iteration (4.3) to €, for an arbitrary starting value 2 € B.(€), now 
follows from Theorem 4.1. 


IA 


We close this section with an example which illustrates the application 
of the method of simultaneous iteration to the solution of a system of 
nonlinear equations. 


Example 4.4 Let us consider, as in Example 4.1, the system of two 
simultaneous nonlinear equations in the unknowns x1 and x2, defined by 
ei+az?—-1 = 0, 
5a? + 2142-9 = 0. 
Here x = (21,%2)' and f = (fi, fo)? with 
fi(t1,v2) = a{+23-1, 
fo(a1, £2) = 5x? + 2123 —9. 


Let us suppose that we need to find the solution of the system f(x) =0 
in the first quadrant of the (a1, %2)-coordinate system. 


Of course, the example is a little artificial, since we already know from 
Example 4.1 that €, = (W3/2,1/2)" is the required solution. In what 
follows, however, we proceed as if we knew nothing about the location 
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of €5. Our aim here is to illustrate the construction of the function g 
from f and the verification of the hypotheses of Theorem 4.1. 
Let us rewrite the two equations as 


1/2 1 Q\1/2 
zy = (1-23 , 2 = —= (9 -5z ; 
1 ( 5) 2 V1 ( i) 
and define gi(#1,72) and go(x1,x%2) as the right-hand sides of these, 
respectively. We consider the simultaneous iteration 


g(ktl) — g(a) : b= 03 1 Decks (4.14) 


with suitably chosen #) and g = (g1, 92)". 

Our first task is to find a closed subset D of R? containing the required 
solution, such that g satisfies the hypotheses of Theorem 4.1 on D. In 
order to ensure that a +> g(a) is real-valued and continuous, and that 
the partial derivatives of g; and gz are continuous at 2 = (21,22)? € R?, 
we demand that |a2| < 1 and |a| < 3/(V5). In fact, since we are looking 
for a solution in the first quadrant, it is natural to suppose that x; > 0, 
x2 > 0. Hence we let M = {a € R?:0 < 2 < 3/75, 0< a2 < 1}, and 
we seek D as a suitable closed subset of M. 

For a € M, let 


_ 091/024 0g1/Ox2 
Jg(a) - ( 092/021 0g2/O0x2 , 


Clearly, 
st 0, sit = =a (1-03), 
ae re a 
so we conclude that, for any x € M, 
|Jo(2) fo = max (09 (1 28)? , Fon (9-03) ™”) 


In particular, we have ||J,(x)||o0. < 1 provided that 
ee<1—a2 and 25a? <21(9 —522), 


that is, when 73 < 1/2 and 2] < 189/130. These conditions are clearly 
satisfied if, for example, 0 < 2, <1 and 0 < x2 < 3/5. If we now define 
D = (0, 1] x [0,3/5], then, analogously as in (4.13), we have that 


lg(@) — g(w)llo < bts | Jg(tz + (1 — t)y)Iloo |]2 — Ylloo 
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for all x and y in D. Therefore, also, 


lg(@) — g(W)lloo < L|la — ylloo 
with 
L = max ||Jg(2)lloo <1. (4.15) 


With our choice of D, (4.15) holds with L = max{0.75, 0.55} = 0.75 < 1. 
Furthermore, it is easy to check that g(D) Cc D. Thus we deduce from 
Theorem 4.1 that g has a unique fixed point in D — we call this fixed 
point €5, for the sake of consistency with the notation in Example 4.1; 
moreover, the sequence (a *)) defined by (4.14) converges to 5. 

After all these preparations you are now probably curious to see 
what the successive iterates look like: Table 4.1 gives a flavour of the 
behaviour of the sequence (a)), with the starting value chosen as 
x) = (0.5,0.3)". You can see that after 15 iterations the first 5 decimal 
digits have settled to their correct values.+ 


4.3 Relaxation and Newton’s method 


We now go on to apply the ideas developed in the previous section to 
the construction of an iteration which converges to a solution of the 
equation f(a) = 0, where f: R” — R”. One way of constructing such 
a sequence is by relaxation. 


Definition 4.4 The recursion 


kt) — ol) _ \ f(a) | k=0,1,2,..., (4.16) 


where ao € R” is given and where X £0 is a constant, is called simul- 
taneous relaxation. 


Suppose that the sequence (a)) converges to a limit € € R” and f is 
continuous in a neighbourhood of €; then, on passing to the limit k — oo 
in (4.16), we deduce that € is a solution of the equation f(x) = 0. 

Simultaneous relaxation is evidently a simultaneous iteration defined 
by taking g(x) = w — Af (a). 

1 You may wish to contemplate the following question: how many iterations should 


be performed to ensure that all 15 digits have settled to their correct values? Use 
inequality (4.6) to get an idea of the (maximum) amount of work involved! 
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Table 4.1. The first 15 iterates in the sequence x) = (xi*), of*))T 
defined by (4.14), with starting value (0.5,0.3)". The exact solution is 
£5 = (V3/2, 1/2)? = (0.866025403784439, 0.500000000000000)" to 15 
decimal digits. 


2 


x2(k) 


OOBONDWBWNFr OO] 


0.500000000000000 
0.953939197667987 
0.794325110362489 
0.887747281827575 
0.849502989281489 
0.871246402792635 
0.862120217116774 
0.867271349636195 
0.865097196405654 
0.866322220091208 
0.865804492286815 
0.866096083560039 
0.865972810920378 
0.866042232825645 
0.866012881963649 
0.866029410728674 


0.300000000000000 
0.607492896293956 
0.460331145598201 
0.527583804908580 
0.490845908224662 
0.506703790432366 
0.497835722000956 
0.501604267098156 
0.499485546313646 
0.500382434879534 
0.499877559050176 
0.500091082450647 
0.499970850112656 
0.500021687802653 
0.499993059704778 
0.500005163847862 


Theorem 4.3 Suppose that f(€) = 0, and that all the first partial 
derivatives of f = (fi,... 
(open) neighbourhood of €, and satisfy a condition of strict diagonal 


fn) are defined and continuous in some 


dominance at €; 1.e., 


, #=1,2,...,n. (4.17) 


Then, there exist € > 0 and a positive constant X such that the relaxation 
iteration (4.16) converges to € for any xo in the closed ball B-(€) of 
radius €, centre &. 


Proof The elements of the Jacobian matrix J,(€) = (y;;) € R"*” of the 
function w+ g(x) = x —Af(x) at x =€ are 


fi 
Ox; 


-(€), yl) =-AB==(6), FAI, 47€ {1,...,n}. 
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We now define 


n_ Ofi 


m = max" (g) 


and then choose \ = 1/m. Under hypothesis (4.17), m > 0 and therefore 
> 0. This choice of A ensures that all the diagonal elements ¥;;(£), 


i=1,...,n, of J,(€) are nonnegative. Moreover, for any 7 € {1,...,n}, 
“ Ofi “| Of: 
ag = 1- 1 ) 
a) OD r(6)| < 
= = 


jFi 
by condition (4.17); consequently, ||Jg(€)|lo < 1. As € is a fixed point 
of g, it follows from Theorem 4.2 that there exists « > 0 such that the 
iteration (4.16) converges to € for all w) € B.(€). 


The condition of strict diagonal dominance will only be satisfied in a 
small class of problems (although this class does contain some examples 
of practical importance). More generally it will be necessary to replace 
the scalar \ by a nonsingular constant matrix A, giving a more general 
relaxation iteration 


etl) — op (*) —Af(ax™), PSO. 2eces 


This may be interpreted as trying to solve the new system of equations 
Af(ax) = 0. The Jacobian matrix of this system is AJy, where Jy is 
the Jacobian matrix of f. It is now possible to select the matrix A so 
that AJ;(&) has the property of strict diagonal dominance. In principle, 
this can obviously be done by choosing A = [J¢(€)]~', the inverse of the 
Jacobian matrix of f evaluated at the solution €. The Jacobian matrix 
of the new system is then the identity matrix, which clearly satisfies 
the diagonal dominance condition. However, this choice is not possible 
in practice, since of course the solution € is unknown. If we allow the 
matrix A to be a function of x, instead of being constant, the argument 
above suggests taking 


A= [U(2), 


leading to Newton’s method for a system of equations. 


Definition 4.5 The recursion defined by 


wht) — ol) _ [Jp(e))) fe), k&=0,1,2,..., (4.18) 


where xo € R”, is called Newton’s method (or Newton iteration) for 
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the system of equations f(x) = 0. It is implicitly assumed that the 
matric Js(a@)) exists and is nonsingular for each k = 0,1,2,.... 


The next theorem is concerned with the convergence of Newton’s 
method. As in the scalar case, for a starting value x) that is suffi- 
ciently close to the solution € of f(x) = 0, Newton’s method converges 
quadratically. The precise definition of quadratic convergence is given 
below: it resembles Definition 1.7 of Chapter 1. 


Definition 4.6 Suppose that (a) is a convergent sequence in R" and 
*) We say that (a)) converges to € with at least 
order gq > 1, if there exist a sequence (€,) of positive real numbers 


converging to 0, and w > 0, such that 
ja — El], < en, &=0,1,2,..., and lim = = 


(4.19) 
If (4.19) holds with ex = ||a — Ello, k =0,1,2,..., then the sequence 
(a)) is said to converge to € with order q. In particular, if q = 2, 


then we say that the sequence (a*)) converges to & quadratically. 


Again, due to (4.4), if a sequence (a‘*)) converges quadratically in the 
oo-norm, then it also does so in the p-norm for any p € [1, 00), though 
the constant 4 may be different. 


Theorem 4.4 Suppose that f(€) = 0, that in some (open) neighbour- 
hood N(€) of €, where f is defined and continuous, all the second-order 
partial derivatives of f are defined and continuous, and that the Jaco- 
bian matrix J;(€) of f at the point € is nonsingular. Then, the sequence 
(a) defined by Newton’s method (4.18) converges to the solution € pro- 
vided that 2) is sufficiently close to €; the convergence of the sequence 
(a*)) to € is at least quadratic. 


Proof Let us begin by writing Newton’s method as a simultaneous iter- 
ation a+) — g(a), k =0,1,2,..., as in (4.3), with a given and 


g(x) = x — [J p(x)" f(a). 


The idea of the proof is to verify that the function g satisfies all the 
conditions of Theorem 4.2 in a certain closed ball centred at €, the fixed 
point of g, and thus deduce that the sequence (a*)) converges to €. 
As the function « +> detJ/(x) is continuous in N(€) and det J; (€) 4 0, 
there exists « > 0 such that detJ;(a) # 0 for all a € B.(€) C N(E€). 


120 4 Simultaneous nonlinear equations 


Further, as the entries of [J(a)]~' depend continuously on the entries 
of J(a) and since the entries of J;(-) are continuous functions of « in 
N(€), we deduce that x +> [J(x)]~' f(a) is a continuous function on 


B-(€); therefore, 
w+ g(x) = x —[Jp(x)]~* f(x) 


is also a continuous function on B.(€). For later reference, we note that 
x |\[J(x)]~"|l00, too, is a continuous function on B.(€), and therefore 
it is a bounded function on B.-(€); we define 
C= max, ||[Jp(#)]“"lloo- 
we B.(€) 

Now, & is a fixed point of g and, by the hypotheses of the theorem, 
the entries of the Jacobian matrix J, of g are continuous functions of 
x on B.(€). Furthermore, it is easy to check that all the elements of 
the Jacobian matrix J,(a) of g vanish at a = €; see Exercise 6. Hence, 
| Jg(&) loo = 0 < 1, trivially. Thus we have shown that g: R” — R” 
satisfies all the conditions of Theorem 4.2 on the closed set D = B.(€), 
and the convergence of the sequence (a“*)) to €, as k — oo, follows. 

To show that convergence is at least quadratic, we write the iteration 
in the form 


Jp (a) [wD — € = Jy(w@™) [x — €] — f(a). (4.20) 


Taylor’s Theorem for a function of n variables, Theorem A.7 (including 
only the first-order terms), implies that, when «) € B.(€), 


0= f(é) = f(e@™) + F(a) [€ - 2] + E,, (4.21) 
where 
IEflloo < gn? AgllE — & |[2,, (4.22) 
and 
Of, | 
Af = _ max By 
1<i,j,l<n xe B.(€) Ox jOx) 


is a bound on all the second-order partial derivatives of f on B-(€). The 
factor n® in (4.22) stems from the fact that, for each i € {1,...,n}, fi 
is a function of n variables and therefore it has n? second-order partial 
derivatives — each bounded by Ay over B.(€). From (4.21) and (4.20) 
we see that 


g(k+1) _ é = [Jp (a) |B, 
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and so 
tPF) — lla < dn? As Clla™ — €|[2,. 


On writing M = $n? AfC, we then deduce by induction that 


1 2" 
lax — Ello S =e (Alla — Ello), = 0,1,2,.... 
Suppose that «© € B.(€) where e < $ min{1,1/M}. Then, 
1 
Ma — Elloc < 5 k=0,1,2,..., 


and hence 


tin? 
(k) _ < = 
la ~ el < 2 (3) 


This implies that convergence is at least quadratic (on choosing e, = 
M~12-2" and q = 2 in Definition 4.6). 


Newton’s method is defined in (4.18) by using the inverse of the Jaco- 
bian matrix. As we saw in Chapter 2 it is more efficient to avoid inverting 
a matrix, if possible. In practice the method is therefore implemented 
by writing (4.18) in the form 


I (a fet) — 9) = — f(a), (4.23) 


Given the vector «), we calculate f(a) and the Jacobian matrix 


J; (a) € R"*", and then solve the system of linear equations (4.23 
f %) 


y] 


by Gaussian elimination; this gives the increment vector a(*t!) — a 
which is added to «) to obtain the new iterate a+), 


Example 4.5 We close this section with an example which illustrates the 
application of Newton’s method. Consider the simultaneous nonlinear 
equations 


filay,z) = w+y4+2-1=0, 
fo(x, y, 2) = Qe? + y? —4z=0, 
fx(x,y,2) = 3a*-4y+27=0. 


Letting f = (fi, fo, fg)" and x = (x,y,z), the aim of the exercise is to 
determine the solution to the equation f(x) = 0 contained in the first 
octant {(z,y,z) € R?: x >0, y>0, z>0} in R®. 
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0.275 


Fig. 4.3. Example 4.5: Projections onto the (y, z)-plane of the intersection- 
curves of the surfaces fi (x, y, z) = O and f(z, y, z) = 0, and fi(z,y, z) = 0 and 
f3(x,y, Z) = 0. The two curves intersect at the point P whose two coordinates 
are the y- and z-coordinates of €, the solution of the system fi(z, y,z) = 0, 
fo(x,y, z) =0, f(a, y, z) = 0. 


Note that the Jacobian matrix of f at x € R? is 


2x 22 i  2z 
Jy(z)= | 4a 2y —-A4 
62 -4 2z 


Since the first equation represents a sphere of radius 1 centred at 
(0, 0,0), and the second and third equations describe elliptic paraboloids 
whose axes are aligned with the coordinate semi-axes (0,0, z), z > 0, 
and (0,y,0), y > 0, respectively, the point of intersection of the three 
surfaces belongs to [0,1]. Let us denote this point by €. In order to 
select a suitable starting value 2) for the iteration, we observe that 
the intersection of the first and the second surface is a curve whose 
projection onto the (y,z)-plane has the equation y? + 227 + 4z = 2, 
while the intersection of the first and the third surface is a curve whose 
projection onto the (y, z)-plane has the equation 3y? + 4y + 227 = 3. 
The two curves are shown in Figure 4.3; the point P where the curves 
intersect has the same y- and z-coordinates as €. The x-coordinate of 
€ can be obtained from the first equation in terms of the y- and z- 
coordinates of P via « = +(1— y? — z?)!/?. As the two coordinates of 
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P are, very roughly, y + 0.5 and z = 0.5, it is reasonable to choose as 
starting value for the Newton iteration the point #) = (0.5, 0.5,0.5)7. 
Thus, f (a2) = (—0.25, —1.25, —1.00)T and 


ie, ae. 
Jj(~™)=[ 2 1 -4 
a | 


On solving the system of linear equations 
Jy (x) (2 = 2 = — f(a) 
for «) — 2), we find that «) = (0.875, 0.500, 0.375)". Similarly, 


a?) = (0.78981, 0.49662, 0.36993)" , 
a3) = (0.78521, 0.49662, 0.36992)? . 


As f(a) = 1075(1,4,5)7, the vector 2®) can be thought of as a 
satisfactory approximation to the required solution €; after rounding to 
four decimal digits, we have that 


x = 0.7852, y = 0.4966 , z = 0.3699. 


4.4 Global convergence 


Much of the discussion of the global convergence of Newton’s method for 
a single equation in Section 1.7 applies, with obvious changes, in the case 
of several variables. If the system has several solutions, €,, 5, ..., we 
can define the corresponding sets 5}, S2,... in R” so that S; comprises 
those starting points from which Newton’s method converges to €;. As 
before, the sets S;, 7 = 1,2,..., have the property that any point on the 
boundary of one of the sets is also on the boundary of the others. The 
difference now is that for systems of equations in R”, n > 2, these sets 
can be much more complicated than in the case of a single equation on 
the real line R!' = R. 

To illustrate this point for n = 2, we return to our earlier example 
problem, Example 1.7 from Chapter 1, but now extend it to complex 
variables, so we require to solve e* — z — 2 = 0 for the complex number 
z=a+ywy. Separating this equation into real and imaginary parts we 
obtain a system of two nonlinear equations for the unknowns 2; = x 
and x2 = y. The system has the two real solutions which we found in 


124 4 Simultaneous nonlinear equations 


Chapter 1, and also an infinite number of complex solutions. It is easy 
to see from the periodic character of e’” that the equation has a solution 
near Wm = (2m + s)em, 1 = V—I, for integer values of m; a better 
estimate is given in Exercise 9. It is a good deal more difficult to prove 
that there are no other solutions. 

The behaviour of Newton’s method for this problem may be illustrated 
by showing a picture of the complex plane, with the sets S; depicted in 
different colours. In our example we cannot, of course, show more than 
a small number of the solutions, and cannot use an infinite number of 
colours. We have therefore coloured the sets with six colours cyclically, 
so that, for example, the sets $1, .57,513,... have the same colour. The 
background colour, white, represents the set S$; of points from which 
the iteration converges to the real negative root. It includes most of 
the negative half-plane. Successive pictures in the series from Figure 4.5 
to Figure 4.9 show a magnified view of a small region of the previous 
picture, the region being outlined in black. In Figure 4.4 the black 
crosses mark the positions of solutions of f(z) = 0. The pictures show 
in a striking way the fractal behaviour of the boundary of a set. Figure 
4.9 is very similar to Figure 4.5; the former is a magnified view of a 
small part of Figure 4.5, with a magnification of about 50000 in each 
direction. The same sort of behaviour is repeated when the picture is 
magnified indefinitely. 


4.5 Notes 


For an introduction to the topology of R”, including the definitions of 
open set, closed set, continuity, convergence and Cauchy sequence, the 
reader is referred to any standard textbook on the subject; see, e.g., 


» W. Rubin, Principles of Mathematical Analysis, Third Edition, In- 
ternational Series in Pure and Applied Mathematics, McGraw-Hill, 
New York, Auckland, Diisseldorf, 1976, 

» S.A. Doua.ass, Introduction to Mathematical Analysis, Addison— 
Wesley, Reading, MA, 1996. 


Our first remark concerns the Contraction Mapping Theorem, Theo- 
rem 4.1, which is a direct generalisation of Theorem 1.3 from Chapter 
1. Comparing the proofs of Theorems 1.3 and 4.1, we see that the proof 
of Theorem 1.3 is much simpler. This is not accidental: in the case of 
a single equation « = g(x), involving a real-valued function g of a sin- 
gle real variable xz, the existence of a fixed point follows directly from 
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Theorem 1.2, Brouwer’s Fixed Point Theorem on a bounded closed in- 
terval of the real line. On the other hand, for the simultaneous system 
of equations x = g(a) in R” considered in Theorem 4.1 we had to invoke 
the completeness of R” (i.e., the property that every Cauchy sequence 
in R” is a convergent sequence) to show the existence of a fixed point. 
An alternative, shorter proof of Theorem 4.1 could have been devised 
by applying Brouwer’s Fixed Point Theorem in R”. 


Theorem 4.5 (Brouwer’s Fixed Point Theorem) Let us assume 
that D is a nonempty, closed, bounded and convex subset of R". Suppose 
further that g: R" > R®” is a continuous function defined on D such that 
g(D) CD. Then, there exists € € D such that g(&) = &. 


A set D C R" is said to be convex if, whenever x and y belong to D, 
also 


de+(1-O)yeD YV6E(0,1]. 


For example, any nonempty interval of the real line R' = R is a convex 
set, as is a nonempty (open or closed) ball in R", n > 2. Unfortunately, 
when n > 2 the proof of Theorem 4.5 is nontrivial and is well beyond 
the scope of this book. 

Benoit Mandelbrot (1924— ) has been largely responsible for the present 
interest in fractal geometry and its connections with iterative methods. 
Mandelbrot highlighted in his book 


» B. MANDELBROT, Fractals: Form, Chance, and Dimension, W.H. 
Freeman, San Francisco, 1977, 


and, more fully, in 


> B. MANDELBROT, The Fractal Geometry of Nature, W.H. Freeman, 
New York, 1983, 


the omnipresence of fractals both in mathematics and elsewhere in na- 
ture. In relation with the subject of this chapter, we note that the 
Mandelbrot set is a connected set of points in the complex plane de- 
fined as follows. Choose a point zo in the complex plane, and consider 
the iteration zn41 = 22+29,n =0,1,2,.... If the sequence 2, 21, 22,... 
remains within a distance of 2 from the origin for ever, then the point zo 
1 For a proof of Theorem 4.5 in the case when D is a closed ball in R”, see John 


W. Milnor, Topology from the Differentiable Viewpoint, Princeton Landmarks in 
Mathematics, 1997. 
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is said to be in the Mandelbrot set. If the sequence diverges from the 
origin, then the point Zo is not in the set. 

A standard reference for theoretical results concerning the convergence 
of Newton’s method in complete normed linear spaces is 


» L.V. KANTOROVICH AND G.P. AKILOV, Functional Analysis, Second 
edition, Pergamon Press, Oxford, New York, 1982. 


A further significant book in the area of iterative solution of systems of 
nonlinear equations is the text by 


» J.M. ORTEGA AND W.C. RHEINBOLDT, Iterative Solution of Non- 
linear Equations in Several Variables, Reprint of the 1970 original, 
Classics in Applied Mathematics, 30, SIAM, Philadelphia, 2000. 


It gives a comprehensive treatment of the numerical solution of n non- 
linear equations in n unknowns, covering asymptotic convergence results 
for a number of algorithms, including Newton’s method, as well as exis- 
tence theorems for solutions of nonlinear equations based on the use of 
topological degree theory and Brouwer’s Fixed Point Theorem. 


Exercises 


4.1 Suppose that the function g is a contraction in the oo-norm, as 
in (4.5). Use the fact that 


Ig(x) — g(y)\lp < n1/?|I|g(x) — g(y) loo 


to show that g is a contraction in the p-norm if L < n7!/?. 


4.2 Show that the simultaneous equations f(x,,2%2) = 0, where 
f _ (fi, fo), with 
filer, 2) = 2} + 25 — 25, fo(@1, 22) = 21 — Trg — 25, 
have two solutions, one of which is x; = 4, x2 = —3, and find the 


other. Show that the function f does not satisfy the conditions 
of Theorem 4.3 at either of these solutions, but that if the sign 
of fz is changed the conditions are satisfied at one solution, 
and that if f is replaced by f* = (fo — fi,—f2)", then the 
conditions are satisfied at the other. In each case, give a value 
of the relaxation parameter A which will lead to convergence. 


4.3 


4.4 


4.5 


4.6 
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The complex-valued function z +> g(z) of the complex variable 
z is holomorphic in a convex region 2 containing the point ¢, at 
which g(¢) = ¢. By applying the Mean Value Theorem (The- 
orem A.3) to the function y of the real variable t defined by 
p(t) = g((1 — t)u+ tv) show that if u and v lie in Q, then there 
is a complex number 7 in 2 such that 


g(u) — g(v) = (u— v)g'(n). 


Hence show that if |g’(¢)| < 1, then the complex iteration de- 
fined by zp41 = g(ze), k = 0,1,2,..., converges to ¢ provided 
that zo is sufficiently close to ¢. 

Suppose that in Exercise 3 the real and imaginary parts of g are 
u and v, so that g(x + zy) = u(x, y) + 2(z,y), 1 = V—1. Show 
that the iteration defined by # +) = g*(a)), k = 0,1,2,..., 
where g*(a) = (u(a1,#2),v(21,22))7, generates the real and 
imaginary parts of the sequence defined in Exercise 3. Compare 
the condition for convergence given in that exercise with the 
sufficient condition given by Theorem 4.2. 

Verify that the iteration 2+) = g(a), k = 0,1,2,..., where 
g = (91,92)* and g; and gz are functions of two variables defined 
by 


gi(@1, 22) = 3 (24 —434+3), go(#1,%2) = (20122 1) 5 


has the fixed point 2 = (1,1)7. Show that the function g does 
not satisfy the conditions of Theorem 4.3. By applying the 
results of Exercises 3 and 4 to the complex function g defined 
by 


g(z) = 3(27 +342), zE€C, 1=Vv-1, 


show that the iteration, nevertheless, converges. 

Suppose that all the second-order partial derivatives of the func- 
tion f: R” — R” are defined and continuous in a neighbourhood 
of the point € in R”, at which f(€) = 0. Assume also that the 
Jacobian matrix, J;(a), of f is nonsingular at a = €, and de- 
note its inverse by K(a) at all # for which it exists. Defining the 
Newton iteration by #*+) = g(a), k = 0,1,2,..., with ao 
given, where g(a) = a — K(x) f(a), show that the (2, 7)-entry 
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4.7 


4.8 
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of the Jacobian matrix J,(a) € R"*” of g is 


k k 
on . . 
fr — >) Kirdry, 19: SL, wn hy 
r=1 r=1 
where J;; is the (r,j)-entry of J¢(a). Deduce that all the ele- 
ments of this matrix vanish at the point €. 


The vector function «+> f(a) of two variables is defined by 
fi(t1,22)=a}+a3-2, — fo(wi,2) = 21-22. 


Verify that the equation f(a) = 0 has two solutions, 71 = 2 = 
1 and xz; = z2 = —1. Show that one iteration of Newton’s 
method for the solution of this system gives a!) = (2, &)T, 


with 
2 2 
a?) + (x) +2 
(1) a 
pe  , ,) 

2 Gi + £5 ) 
Deduce that the iteration converges to (1,1)? if x)’ + 25 0) ig 
positive, and, if o ta a?) is negative, the iteration converges 
to the other solution. Verify that convergence is quadratic. 
Suppose that € = limgz.. a) in R”. Following Definition 
1.4, explain what is meant by saying that the sequence (a) 


(0) 


converges to € linearly, with asymptotic rate —log,g uw, where 
O<p<l. 

Given the vector function «+> f(x) of two real variables x1 
and x2 defined by 


filzi, 22) =a} + 22-2, fo(ai1, v2) = 41 + 42-2, 


show that f(€) = 0 when € = (1, 1)". Suppose that x0) ¢ 2”, 
show that one iteration of Newton’s method for the solution 
of f(a) = 0 with starting value x = (x0), 2{)T then gives 
gt) = (a? , a) T such that af) + ot ) = 2. Determine «) 
when 
0) =l-+a, 0) =l-a, 

where a £ 0. Assuming that o) x 0), deduce that Newton’s 
method converges linearly to (1,1)', with asymptotic rate of 
convergence log;, 2. Why is the convergence not quadratic? 


4.9 
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Suppose that the equation e* = z+ 2, z € C, has a solution 
z= (2m+ $)om + In[(2m+ $)r] +7, 
where m is a positive integer and 1 = \/—1. Show that 
n = In[l — c(In(2m + 3) +7 + 2)/(2m + $7) 


and deduce that 7 = O(In m/m) for large m. 
(Note that | In(1 + 2t)| < |t] for all t € R \ {0}.) 
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Fig. 4.4. The sets S; in the region —5 < a < 15, -4< y < 24 of the complex 
plane. 


Fig. 4.5. The sets S; in the region 2 < x < 3, 1.6 < y < 2.6 of the complex 
plane. 


Exercises 131 


Fig. 4.6. The sets S; in the region 2.4 < x < 2.55, 2.1 < y < 2.25 of the 
complex plane. 


Fig. 4.7. The sets 5; in the region 2.4825 < a < 2.4975, 2.2075 < y < 2.2225 
of the complex plane. 
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Fig. 4.8. The sets S; in the region 2.4930 < x < 2.4960, 2.2100 < y < 2.2130 
of the complex plane. 


Fig. 4.9. The sets S; in the region 2.493645 < x < 2.493665, 2.21073 < y < 
2.21075 of the complex plane. 
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Eigenvalues and eigenvectors of a symmetric 
matrix 


5.1 Introduction 


Eigenvalue problems for symmetric matrices arise in all areas of ap- 
plied science. The terminology eigenvalue comes from the German word 
Eigenwert which means proper or characteristic value. The concept of 
eigenvalue first appeared in an article on systems of linear differential 
equations by the French mathematician d’Alembert! in the course of 
studying the motion of a string with masses attached to it at various 
points. 

Let us recall from Chapter 2 the definition of eigenvalue and eigen- 
vector. 


Definition 5.1 Suppose that A € R"*”". A complex number X for which 
the set of linear equations 


Ax = r\x (5.1) 


has a nontrivial solution x € C? = C” \ {0} is called an eigenvalue 
of A; the associated solution « € C? is called an eigenvector of A 
(corresponding to X). 


1 Jean le Rond d’Alembert (17 November 1717, Paris, France — 29 October 1783, 
Paris, France) was abandoned as a newly born child on the steps of the church of 
St Jean le Rond in Paris and spent his early life in a home for homeless children. 
d’Alembert was the central mathematical figure among the French Encyclopedists 
in the period 1751-1772; the Encyclopedia, edited by Jean Diderot, comprised 28 
volumes. D’Alembert made a number of significant contributions to the dynamics 
of rigid bodies, hydrodynamics, aerodynamics, the three-body problem, and the 
theory of vibrating strings. 
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In order to motivate the discussion that will follow, we begin with two 
familiar elementary examples. 

In considering the rotation of a rigid body 2 C R°, the inertia matrix 
is the 3 x 3 symmetric matrix 


Log —Igy =1py 
J= —Lyx Lyy — —Lyz 
—Izz Izy Iz; 


whose diagonal elements are the moments of inertia about the axes, 


— fw +27) dQ, Iyy = NG +2°)dQ, Iz = [oe +y°) dQ, 
Q 2 ” 


and whose off-diagonal elements are defined by the corresponding prod- 
ucts of inertia 


Tey = Lye = [ evan. 


Lyz = Ly = [v00. 


Len = a= f anda, 
0) 


Then, the eigenvectors of the inertia matrix are the directions of the 
principal axes of inertia of the body, about which free steady rotation is 
possible, and the eigenvalues are the principal moments of inertia about 
these axes. 

A second example, which involves matrices of any order, arises in the 
solution of systems of linear ordinary differential equations of the form 


where a is a vector of n elements, each of which is a function of the 
independent variable t, and A is an n x n matrix whose elements are 
constants. If A were a diagonal matrix, with diagonal elements a;; = Aj, 
i= 1,2,...,n, the solution of this system would be straightforward, as 
each of the equations could be solved separately, giving 


x(t) = x;(0)exp(A;t), *=1,2,...,n. 


When A is not a diagonal matrix, suppose that we can find a nonsingular 
matrix M such that 


M*1AM=D, 
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where D is a diagonal matrix. Then, on letting 
y=M'z, 


we easily see that 


The solution of this system of differential equations is straightforward, 
as we have just seen, and we then find that 


n 
vj = (My): = S- Mijy;(0) exp(jt) , 
j=l 

where A; = d,; is one of the diagonal elements of D. The numbers 4,, 
j = 1,2,...,n, are the eigenvalues of the matrix A € R”*”, and the 
columns of M are the eigenvectors of A, so the solution of this system 
of differential equations requires the calculation of the eigenvalues and 
eigenvectors of the matrix A. 

In systems of differential equations of this kind the matrix A is not 
necessarily symmetric. In that case, the problem is more difficult; if 
the eigenvalues of A are not distinct there may not exist a complete set 
of linearly independent eigenvectors, and then the matrix M will not 
exist. 

In this chapter, we shall develop numerical algorithms for the solution 
of the algebraic eigenvalue problem (5.1), assuming throughout that A € 
R”*” is a symmetric matrix. As has been noted above, the analogous 
problem for a nonsymmetric matrix is more involved, and will not be 
considered here.? 

Throughout this chapter, the set of all real-valued symmetric matrices 


of order n will be denoted by R%\4"'; thus, given a matrix A = (aj), 


Ace R™ S AcCR™®” & Gig = Aji, b= M2) oe Ne 


sym 
We begin with a reminder of some fundamental properties. 


1 Consider, for example, 


This matrix has one eigenvalue of multiplicity 2, Ay/2 = 1, and only one (linearly 
independent) eigenvector, (1,0)7. 

2 The reader is referred to the last four chapters of J.H. Wilkinson’s monograph, 
The Algebraic Eigenvalue Problem, The Clarendon Press, Oxford University Press, 
New York, 1988. 
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Theorem 5.1 Suppose that A € Rv*”; then, the following statements 


sym? 
are valid. 
(i) There exist n linearly independent eigenvectors « € R” and 
corresponding eigenvalues 4; € R such that Ax = \;a for all 
oe aie Y 
(ii) The function 
At det(A — AL) (5.2) 


is a polynomial of degree n with leading term (—1)"X", called the 
characteristic polynomial of A. The eigenvalues of A are the 
zeros of the characteristic polynomial. 

(iii) If the eigenvalues \; and A; of A are distinct, then the corre- 
sponding eigenvectors x and «) are orthogonal in R”, i.e., 


eT —0 ifryaFArA;, i,j € {1,2,...,n}. 


(iv) If A; is a root of multiplicity m of (5.2), then there is a linear sub- 
space in R” of dimension m, spanned by m mutually orthogonal 
eigenvectors associated with the eigenvalue Aj. 

(v) Suppose that each of the eigenvectors x of A is normalised, 
in other words, eT ae =1 fori =1,2,...,n, and let X denote 
the square matriz whose columns are the normalised (orthogonal) 
eigenvectors; then, the matrix A = X™AX is diagonal, and the 
diagonal elements of A are the eigenvalues of A. 

(vi) Let Q € R"*” be an orthogonal matrix and define B € Ruy by 
B = QTAQ; then, det(B — AI) = det(A — AI) for each XE R. 
The eigenvalues of B are the same as the eigenvalues of A, and 
the eigenvectors of B are the vectors QTa™, i =1,2,...,n. 

(vii) Any vector v € R” can be expressed as a linear combination of 
the (ortho)normalised eigenvectors w, i = 1,2,...,n, of A, i.e., 


n 
v= y ae, a =a2OTy, 
i=1 


(viii) The trace of A, Trace(A) = 77", aii, is equal to the sum of the 
eigenvalues of A. 


These properties should be familiar; proofs will be found in any stan- 
dard text on linear algebra.! 


1 See, for example, T.S. Blyth and E.F. Robertson, Basic Linear Algebra, Springer 
Undergraduate Mathematics Series, Springer, 1998, A.G. Hamilton, Linear Alge- 
bra, Cambridge University Press, 1990, or R.A. Horn and C.R. Johnson, Matrix 
Analysis, Cambridge University Press, 1992. 
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5.2 The characteristic polynomial 


Given that A € R"*” and n < 4, it is quite easy to write down the 
characteristic polynomial det(A—AJ) by expanding the determinant, and 
then find the roots of this polynomial of degree n in order to determine 
the eigenvalues of A. If n > 4 there is no general closed formula for 
the roots of a polynomial in terms of its coefficients, and therefore we 
have to resort to a numerical technique. A further difficulty is that 
the roots may be very sensitive to small changes in the coefficients of 
the polynomial, and we find that the effect of rounding errors in the 
construction of the characteristic polynomial is usually catastrophic. 


Example 5.1 Consider, for example, the diagonal matrix of order 16 
whose diagonal elements are j + 3 j =1,2,...,16; the eigenvalues are, 
of course, just the diagonal elements. Constructing the characteristic 
polynomial, working with 10 significant digits throughout, gives the result 


At®-= TAT. 3333333A\)° 4°9193.3333393" ss 


Using a standard numerical algorithm (such as Newton’s method) for 
computing the roots of the polynomial and working with 10 significant 
digits gives the smallest root as 1.333333331, which is nearly correct to 
10 significant digits. The three largest roots, however, are computed as, 
approximately, 15.5 + 1.32 and 16.7, which are very different from their 
true values 14.3, 15.3, 16.3, respectively, even though the matrix in this 
example is of quite modest size, and the eigenvalues are well spaced. 
Thus we conclude from this example that the numerical method which 
constructs the characteristic polynomial and finds its roots is completely 


unsatisfactory for general use, except for matrices of very small size. O 


The fact that in general the roots of the characteristic polynomial 
cannot be given in closed form shows that any method must proceed 
by successive approximation. Although one cannot expect to produce 
the required eigenvalues exactly in a finite number of steps, we shall see 
that there exist rapidly convergent iterative methods for computing the 
eigenvalues and eigenvectors numerically. 


5.3 Jacobi’s method 


This method uses a succession of orthogonal transformations to produce 
a sequence of matrices which approaches a diagonal matrix in the limit. 
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Each step in the process involves a matrix representing a plane rotation. 
We begin with a simple example. 


Example 5.2 (The plane rotation matrix in R?) Let us suppose 
that p € [—1,7] and consider the matrix R(y) € R?*? defined by 


cosy  siny 
R(y) = : 
() ( —siny cosy ) 


For a vector x € R*, R(y)x is the plane rotation of x around the ori- 
gin by an angle y (in the clockwise direction when p > 0 and in the 
anticlockwise direction when ~ < 0). 

We note in passing that since cos(—y) = cosy, sin(—y) = —siny and 
cos? y + sin? y = 1, we have that 


(R(y))"=R(-~) and —R(y) R(-~) = I. 
Hence R(y) is an orthogonal matriz; i.e., 
R(p) Ry)" = Ry)" Ry) = 1, 
where I is the 2 x 2 identity matria. 
The next definition extends the notion of plane rotation matrix to R”. 
Definition 5.2 (The plane rotation matrix in R”) Suppose that 
n>2,1<p<q<nand vg € [-1,7]. We consider the matriz 


RD (y) € R"*” whose elements are the same as those of the identity 
matric I € R"*", except for the four elements 


Tpp = C, Tng = 8, 
qp — 8; "aq — ©s 
where c= cos y, s =sin y. 
As in Example 5.2, it is a straightforward matter to show that 
(RPM (p))F = RPP(—y), REMY) REO(-g) =F, 
and that, therefore, 
RED (g)(ROM(Q))™ = (ROD (p))™ROM(y) =I. 


Hence R(?2(y) € R”*” is an orthogonal matrix for any p,q such that 
l<p<q<vn, and any y € [-7,7]. 

The basic result underlying Jacobi’s method is encapsulated in the 
next theorem. 
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Theorem 5.2 Suppose that A € RY. For each pair of integers (p,q) 


with 1 <p<q<n, there exists p € [—1/4,7/4] such that the (p,q) 
entry of the symmetric matric R®Y (py) AR! (y) is equal to 0. 


Proof For the sake of notational simplicity, we shall write R instead of 
R‘9 (y) throughout the proof, and abbreviate c = cos y and s = sin y. 
Consider the product A’ = AR. Evidently the only difference be- 
tween A’ and A is in columns p and gq; these columns of A’ are linear 
combinations of the same two columns of A: 
a, = ipl — Aig . 

(eit ‘ i: ae ere Ae (5.3) 

iq = Wips + Gige 
Multiplication of A’ by R™ on the left gives a similar result, but affects 
rows p and q, rather than columns p and q. Writing B = R™ A’ gives 


bp; iy Opie q5® Ms j=1,2,...,n. (5.4) 
bai = Anji + Agi 


Combining these equations shows that B = R™A R, where 


_ Dien dis 2 
bpp = AppC* — 2apgSC + AgqqS*, 
bag = App8* + 2apgsc + AqqC’, (5.5) 
fan DDN 
bpq = (App — qq) 8C + Apq(c* — 8°) = dap - 


The remaining elements of B = R™AR in columns p and q are given 
by the expressions 


bip = GipC — igs : . 
» eH 2 ens a 1q- 
big = AipnS + Qigl 7 toy 
The matrix B = R™AR is evidently symmetric, so the nondiagonal 

elements of B in rows p and q are also given by the same expressions. 

Finally, we note that all the elements of B which do not lie either 
in row p or q or in column p or q are the same as the corresponding 
elements of A, that is, 

big =aig, iftAp,qandj#p,q. 

We see from (5.5) that in order to ensure that 6,4, the (p, q)-entry of 

the matrix B = R™A R, is equal to 0, it suffices to choose y such that 
2Anq 


tan 2y = ——*_ ; (5.6) 
qq — pp 
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thus we select 
1 1  2Apq 


[—7/4, 7/4]. (5.7) 
Aqq — App 
To see this, apply the trigonometric identities c? — s? = cos(2y) and 


sc = }sin(2y) to bpq in (5.5), with bp, = 0. That completes the proof.! 


We can avoid the trigonometric calculations involved in the formula 
(5.7) for y by writing t = s/c, and seeing that t is required to satisfy 


(Qpp — Aqq)t + Gpq(1 — P)=0, (5.8) 


If apg = 0, we can ensure that (5.8) holds by selecting t = 0 (which 
corresponds to choosing y = 0). If apg # 0 and app = agq, we put 
t = 1 (corresponding to y = 7/4). Finally, if ap, 4 0 and app F aqq, 
we solve the quadratic equation (5.8); there will be two distinct real 
roots, so we define ¢ as the one that is smaller in absolute value. Having 
selected t, we then use the relation sec? y = 1 + tan? ¢ to calculate c by 
c= 1/(1+#?)'/?, and then s from s = ct. 


eT 5.3 (The classical Jacobi nye, Let A & R&X? and 


sym 


define AD = A. Given k > 0 and A € R®X”", the basic step of 


sym ? 
Jacobi’s method computes APT) € Ron ; fst looating the largest . 
absolute value off-diagonal element (Al ea = = alt) of the matrix A™ 
and then setting ARTY) = RD (y,)TAM ROD (—,) with ye chosen so 
as to reduce (A(**1)),,, to zero. This process is then repeated until all 
the off-diagonal elements are smaller than a given positive tolerance e. 


In order to show that as k — oo the sequence of matrices (A) 
generated by successive steps of the classical Jacobi method converges 
to a diagonal matrix (whose diagonal entries are the eigenvalues of the 
original matrix A), we need the following result. 


Lemma 5.1 The sum of squares of the elements of a symmetric matrix 
is invariant under an orthogonal transformation: that is, if A € Ruy 
! For future reference, note that a simple calculation based on (5.5) and (5.6) gives 


0 if iAp.a, 
bj; — aig = —Apq tany if 1=D, 


Qpq tan if t=q. 
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and B= R™AR where R € R"*” is an orthogonal matrix, then 


IE (59) 
i=1 j=l i=1 j=l 
The quantity 
1/2 


Alle = | 30> 0 ai, 


i=1 j=l 


is called the Frobenius norm! of A € R”*”. The Frobenius norm of 
A €R”*" is the 2-norm of A, with A regarded as an element of a linear 
space of dimension n? over the field of real numbers; however, it is not 
a subordinate norm in the sense of Definition 2.10. In particular, the 
Frobenius norm on R”*” is not subordinate to the 2-norm on R”. 

Now, one can express (5.9) equivalently by saying that the Frobenius 
norm of a symmetric matrix A is invariant under an orthogonal trans- 
formation: ||R™AR||r = ||Allr. 


Proof of lemma The sum of squares of the elements of A is the same 
as the trace of A?, for 


Trace(A*) = S(4)a = Seo = yd, (5.10) 
i=l t=1 g=1 w=1 g=1 


since A is symmetric. Analogously, as B = R'AR is symmetric, we 
have that 


Trace(B?) = . 3 be, . 
i=1 j=l 


Thus, it remains to show that Trace(B?) = Trace(A?). Now, 
2—(R'™AR)(R™AR)=R'A?R, (5.11) 


since R is orthogonal. Hence B? is an orthogonal transformation of A? 
which, by virtue of Theorem 5.1 (vi), means that B? and A? have the 
same eigenvalues, and therefore the same trace, since the trace is the 
sum of the eigenvalues (see Theorem 5.1 (viii)). 


1 Ferdinand Georg Frobenius (26 October 1849, Berlin-Charlottenburg, Prussia, 
Germany — 3 August 1917, Berlin, Germany), contributed to the theory of an- 
alytic functions, representation theory of groups, differential equation theory and 
the theory of elliptic functions. 
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Now we are ready to embark on the convergence analysis of the clas- 
sical Jacobi method. 
Theorem 5.3 Suppose that A ¢ RU, n = 2. In the classical Jacobi 
method the off-diagonal entries in the sequence of matrices (AM), gen- 
erated from A®) = A according to Definition 5.3, converge to 0 in the 
sense that 


(k)) 92 = 
Jim, Pe [((AM),j]? = 0. (5.12) 
Ai 
Furthermore, 
G98 = 2 
Jim, DMA Jas]? = Trace(A?) . (5.13) 


Proof Let dpq be the off-diagonal element of A with largest absolute 
value, and let B = (R®(y))7A R®®(y), where — is defined by (5.7). 
Then, letting c= cos y and s = siny, we have that 


bop Ong is Cc Ss : App pq Cc Ss 
bap aq ~ Ls ¢ Agp aq —s c/}’ 


and Lemma 5.1 implies that 


2 2 2 9 2 2 
Dnp + 2Dpg + Ogg = App + 2Apq + Agq - 


Writing 
S(A) = S- az, , D(A) = 5 ai, L(A) = Ee az, , 
a,j=1 i=l t,g=l 


it follows that S(A) = D(A) + L(A). Now S(B) = S(A) by Lemma 5.1, 
and so D(B)+ L(B) = D(A)+ L(A). The diagonal entries of B are the 
same as those of A, except the ones in rows p and gq, 1 <p<q<n. 
Further, as bp, = 0, it follows that be +b. = Gat roa... Therefore, 
D(B) = D(A) + 2a2,. 
Consequently, 
L(B) = L(A) — 2a,. 
Now dp, is the largest off-diagonal element of A; hence L(A) < Naz, 


where N = n(n—1) is the number of off-diagonal elements, and therefore 


L(B) < (1—2/N)L(A). (5.14) 
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On writing AO) = A, AW) = B, and generating subsequent members 
of the sequence (A()) in a similar manner, as indicated in the algorithm 
in Definition 5.3, we deduce from (5.14) that 


0< L(A) < (1—2/N)* L(A), k =1,2,3,..., (5.15) 


where N > 2. Thus we conclude that limz_,., L(A®) =0. 
Now, (5.13) follows from (5.10) and (5.12) on noting that 


Trace(A”) = S(A) = S(A™) = D(/AM) + L(A) = Vk>O, 


and passing to the limit k — oo: Trace(A?) = limz_... D(A). 


According to Theorem 5.1 (viii) the trace of A? is the sum of the eigen- 
values of A”, and the eigenvalues of A? are the squares of the eigenvalues 
of A. Thus, we have shown that the sum of the squares of the diagonal 
elements in the sequence of matrices (A(*)) generated by the classical 
Jacobi method converges to the sum of the squares of the eigenvalues 
of A. More work is required to show that for each i = 1,2,...,n the 
sequence of diagonal elements (al )) converges to an eigenvalue of A as 
k — co. We shall further discuss this question in the final paragraphs 
of Section 5.4. First, however, we describe another variant of Jacobi’s 
method. 


Definition 5.4 (The serial Jacobi method) This version of Jacobi’s 
method proceeds in a systematic order, using transformations R'® (y) 
to reduce to zero the elements (1,2), (1,3), ..., (1,n), (2,3), (2,4), 
(2,n),..., (w—1,n) in this order. The complete step is then repeated 
iteratively. 


It is not difficult to prove that this method also converges. Both 
these variants of the Jacobi method converge quite rapidly; the rate of 
convergence is in practice much faster than is suggested by (5.15), and 
in fact it can be shown that convergence is ultimately quadratic. 

It is time for an example! 


Example 5.3 Let us consider the 5 x 5 matrix 


7 a ee ee Sa 
8 0! 28 A 

A=] 2 01 224). (5.16) 
i. ee Oe ods a 
A ae 
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The values of D(A“) and L(A) after each iteration of the serial 
Jacobi method, with AO) = A, are shown in Table 5.1. The off-diagonal 
elements of the third iterate, AS), are zero to 10 decimal digits. The 
diagonal elements of A®), which give the eigenvalues, are 


8.094, 1.690, —0.671, 7.170, —3.282. 


Note that the eigenvalues do not appear in any particular order. 


Table 5.1. Convergence of the serial Jacobi iteration. 


D(A®) L(A) 


43.000 — 88.00000000 
126.309 4.69087885 
130.981 0.01948855 
131.000 0.00000000 


WNrRO!]F 


This concludes the discussion about the use of Jacobi’s method for 
computing the eigenvalues of a symmetric matrix A. ‘Fine,’ you might 
say, ‘but how do we determine the eigenvectors of A?’ 

It turns out that by collecting the information accumulated in the 
course of the Jacobi iteration, it is fairly easy to calculate the eigenvec- 
tors of A. We begin by noting that if M is an orthogonal matrix such 
that M™AM = D, where D is diagonal, then the diagonal elements of 
D are the eigenvalues of A, and the columns of MW are the corresponding 
eigenvectors of A. 

In the course of the Jacobi iteration (be it classical or serial), we 
have constructed the plane rotations RIG) (p;), j=1,2,...,k. Thus, 
an approximation M“*) to the orthogonal matrix M can be obtained 
by considering the product of these rotation matrices: initially, we put 
M) = T and then we apply the column transformation R®7%)(p;) at 
each step j = 1,2,...,k. This corresponds to multiplying M9-) on 
the right by R(P545) ((:p;) for 7 = 1,2,...,k, and leads to the orthogonal 
matrix 


M®) = R11) (yg) on R@r4) (pp) 


which represents the required approximation to the orthogonal matrix 
M. The columns of M*) will be the desired approximate eigenvectors 
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of A corresponding to the approximate eigenvalues which appear along 
the diagonal of A(®*), 

The Jacobi method usually converges in a reasonable number of itera- 
tions, and is a satisfactory method for small or moderate-sized matrices. 
However, there are many problems, particularly in the area of numeri- 
cal solution of partial differential equations, which give rise to very large 
matrices that are sparse, with most of the elements being zero. A further 
consideration is that in many practical situations one does not need to 
compute all the eigenvalues. It is much more common to require a few of 
the largest eigenvalues and corresponding eigenvectors, or perhaps a few 
of the smallest. Jacobi’s method is not suitable for such problems, as 
it always produces all the eigenvalues, and will not preserve the sparse 
structure of a matrix during the course of the iteration. For example, it 
is easy to see that if Jacobi’s method is applied to a symmetric tridiago- 
nal matrix, then at the end of one sweep all (but two) of the elements of 
the matrix will in general be nonzero and, although still symmetric, the 
transformed matrix is no longer tridiagonal. Later on in this chapter we 
shall consider numerical algorithms for computing selected eigenvalues 
of a matrix. Thus, as an overture to what will follow, we now outline a 
‘rough and ready’ technique for locating the eigenvalues. 


5.4 The Gerschgorin theorems 

Gerschgorin’s Theorem! provides a very simple way of determining a 
region that contains the eigenvalues of a matrix. It is very general, and 
does not assume that the matrix is symmetric; in fact we shall allow 
the elements of a square matrix of order n to be complex and write 
A €C”*" to express this fact. 
Definition 5.5 Suppose that n > 2 and Ac C"*". The Gerschgorin 
discs D;, 1 = 1,2,...,n, of the matrix A are defined as the closed 
circular regions 

D;, = {2 €C: |z — ay| < R;} (5.17) 


in the complex plane, where 


a 
is the radius of D,. 


1 After S.A. Gerschgorin; see the historical survey of Seiji Fujino and Joachim 
Fischer, Uber S.A. Gerschgorin (1901-1933) [German: About S.A. Gershgorin 
(1901-1933)], GAMM Mitt. Ges. Angew. Math. Mech. 21, no. 1, 15-19, 1998. 
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Theorem 5.4 (Gerschgorin’s Theorem) Let n > 2 and A € C"*”. 
All eigenvalues of the matrix A lie in the region D = U;_, Di, where Dj, 
i=1,2,...,n, are the Gerschgorin discs of A defined by (5.17), (5.18). 


Proof Suppose that \ € C and a € C” \ {0} are an eigenvalue and the 
corresponding eigenvector of A, so that 


Sain; = Aa; , GST 2st nt (5.19) 
j=l 


Suppose that 2,, with k € {1,2,...,n}, is the component of x which 
has largest modulus, or one of those components if more than one have 
the same modulus. We note in passing that x, 4 0, given that x 4 0; 
also, 


This means that 


|X — dex] [2x = ALE — Akk LE| 
n 
= > AkjXj — AkkXk 
j=l 


n 
= | aj 
j=l 


J#k 
< |xx|Re, (5.21) 


which, on division by |x|, shows that A lies in the Gerschgorin disc Dz 
of radius R;, centred at agg. Hence, X€ D = U}_, Di. 


Theorem 5.5 (Gerschgorin’s Second Theorem) Let n > 2. Sup- 
pose that 1 < p< n-—1 and that the Gerschgorin discs of the matrix 
AéeC"*” can be divided into two disjoint subsets D®) and D™, con- 
taining p and q=n-—p discs respectively. Then, the union of the discs 
in D®) contains p of the eigenvalues, and the union of the discs in D 
contains n — p eigenvalues. In particular, if one disc is disjoint from 
all the others, it contains exactly one eigenvalue, and if all the discs are 
disjoint then each disc contains exactly one eigenvalue. 


Proof We shall use a so-called homotopy (or continuation) argument. 
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For 0 < e < 1, we consider the matrix B(e) = (b:;(e)) ¢ C”*”, where 


won {a iat 22 
Then, B(1) = A, and B(0) is the diagonal matrix whose diagonal el- 
ements coincide with those of A. Each of the eigenvalues of B(0) is 
therefore the centre of one of the Gerschgorin discs of A; thus exactly p 
of the eigenvalues of B(0) lie in the union of the discs in D®). Now, the 
eigenvalues of B(e) are the zeros of its characteristic polynomial, which 
is a polynomial whose coefficients are continuous functions of ¢; hence 
the zeros of this polynomial are also continuous functions of ¢. Thus as 
€ increases from 0 to 1 the eigenvalues of B(e) move along continuous 
paths in the complex plane, and at the same time the radii of the Ger- 
schgorin discs increase from 0 to the radii of the Gerschgorin discs of 
A. Since p of the eigenvalues lie in the union of the discs in D®) when 
€ = 0, and these discs are disjoint from all of the discs in D(q), these p 
eigenvalues must still lie in the union of the discs in D®) when ¢ = 1, 
and the theorem is proved. 

The same proof evidently still applies when the discs can be divided 
into any number of disjoint subsets. 


Example 5.4 Consider the matrix 


4.00 0.20 —0.10 0.10 
0.20 —1.00 —0.10 0.05 
a —0.10 —0.10 3.00 0.10 | ) 


0.10 0.05 0.10 —3.00 


Figure 5.1 shows, as solid circles, the Gerschgorin discs for this matrix; 
for instance, one of the discs has centre at 4.00 and radius 0.40. The 
discs are clearly disjoint, so that each disc contains one eigenvalue of 
the matrix. The significance of the dotted circles will be explained in our 
next example. 


Example 5.5 Let us consider the matriz A defined by (5.23), and then 
transform it into B = KAK~—', where K € R*** is the same as the 
identity matriz except that keg = K > 0. 


This transformation has the effect of multiplying the elements in row 2 
by «, and multiplying the elements in column 2 by 1/k«; the diagonal 
element ag2 thus remains unaltered. A small value of « then means that 
the second disc of B is smaller than the second disc of A, but the other 
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Fig. 5.1. Gerschgorin discs in the complex plane for the matrix A defined in 
(5.23) (solid circles) and for B = K AK" (dotted circles). The numbers along 
the real axis denote the first coordinate of the centre point of each circle (the 
second coordinate being zero in each case). 


discs grow larger. The dotted discs in Figure 5.1 are for the matrix B 
with « = 1/23. For this value the other discs are still just disjoint from 
the disc centred at —1.00; the disc with centre at 4.00 almost touches 
the disc with centre at —1.00. The disc with centre —1.00 has radius 
0.014, and is too small to be visible in the figure. The eigenvalue in this 
disc is —1.009 to three decimal digits. The same procedure can be used 
to reduce the size of each of the discs in turn. © 


This idea is formalised in the next theorem. 


Theorem 5.6 Let n > 2, and suppose that in the matrix A € C”*” all 
the off-diagonal elements are smaller in absolute value than €, so that 
laij| < e€, for all i,j € {1,2,...,n} with i A 7. Suppose also that for 
a particular integer r € {1,2,...,n} the diagonal element a,, is distant 
6 from all the other diagonal elements, so that |ap, — ay| > 6, for alli 
such thati#~r. Then, provided that 


6 


eS O00) 


(5.24) 
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there is an eigenvalue A of A such that 


|A ape | < 2(n — De? /6. (5.25) 


Proof We apply the similarity transformation 
AeC™™" 1, A’ =KAK*EC™", 


where kK € R”*” is the same as the identity matrix, except that the 
diagonal element in row r is chosen to be ky; = & > 0. This has the 
effect of multiplying the off-diagonal elements of row r by «, and the 
element in column r of row i, where i 4 r, by 1/K. The Gerschgorin 
disc from row r then has centre a,, and radius not exceeding k(n — 1)e, 
and the disc corresponding to row i # r has centre a;; and radius not 
exceeding (n — 2)e + €/k. 

We now want to reduce the size of disc r by choosing a small value of 
k, while keeping it disjoint from the rest. This is easily done by choosing 
k = 2e/6. The radius of disc r does not exceed 2(n — 1)e?/6, and the 
radius of disc i 4 r does not exceed (n — 2)e + $6. The sum of these 
radii therefore satisfies 


R,+Ri < 2(n—1)e?/6+ (n—-2%e+ $6 
< e+(n—2)e+ 56 
<6 (5.26) 
where we have used the given condition (5.24) twice. As the centres a,, 


and a;; of these discs are distant more than 6 from each other, (5.26) 
shows that the two discs are disjoint, and the required result is proved. 


Theorem 5.6 is sufficient to show that for a matrix satisfying its hy- 
potheses we can find a Gerschgorin disc whose radius is of order €? 
provided that ¢ is sufficiently small. It also indicates that the spacing 
between the diagonal elements is important. 

In particular, Theorem 5.6 applies to the matrix A“) which results af- 
ter k iterations of the Jacobi method. If at that stage all the off-diagonal 
elements have magnitude less than ¢ then there is one eigenvalue in each 
of the intervals [a‘ — (n—1)e, a‘ + (n—1)e], provided that these inter- 
vals are disjoint; this follows from Theorem 5.5. If € is sufficiently small 
compared with the distances between the diagonal elements of A“), 
Theorem 5.6 may be used to give closer bounds on the eigenvalues. 
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We close this section with some comments on the convergence of the 
classical Jacobi iteration. According to the Cauchy—Schwarz inequality, 
2 


ye la 25 i a n(n — 1) 3 a. 


$j=t ‘j=l i,gj=l ij=li 
ii a ee JH1LIAS 


2 


n n 
k k 
max Slay) | <n(n—1) DO lap ?, 
ee fe 


o (5.12) implies that 


lim nl Ja =0 


Sse gead 
iwi 

In other words, the radii of the Gerschgorin discs for the matrices in the 
sequence (A“*)) converge to 0 as k > oo. As AC) and A have identical 
eigenvalues for all k, it follows from Theorems 5.4 and 5.5 that the set 
of limiting diagonal entries {limp—.o. ast) ... liIMp +00 ashy delivered by 
the Jacobi iteration is equal to the set of eigenvalues of A. This holds 
irrespective of the spacing between the diagonal entries. 


5.5 Householder’s method 


The general method for finding the eigenvalues of a real symmetric ma- 
trix begins by applying an orthogonal transformation to reduce it to a 
tridiagonal matrix. This can be done in a finite number of steps by using 
Householder matrices. 


Definition 5.6 Given a vector v € R®, the corresponding Householder 
matrix H = H(v) of order n is defined by 


2 
H=I-—vv'" 
vty , 
where I is the identity matrix of order n. 


Clearly, for any vector x € R”, we have 


and hence the vectors Ha, x and v are coplanar. In particular, if ¢ € R” 
and v'a = 0 then Hx = 2, and therefore the (n — 1)-dimensional 
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Ax 


Fig. 5.2. Action of the Householder reflector H: « +» Ha, corresponding to 
v € RY, on a vector a € R”. Hz is the reflection of x in the hyperplane 1 
perpendicular to v. 


hyperplane # consisting of all vectors x that are perpendicular to v in 
R” is invariant under the mapping a + Ha. Finally, for any a € R”, 


vi He =—v' ax. 


Hence, if the angle between a and v is denoted by y, then the angle 
between v and Hz is equal to 7+y. We conclude from these observations 
that the vector Ha is the reflection of x in the hyperplane H. For this 
reason, the mapping x +> Hz is frequently referred to as Householder 
reflector, corresponding to the vector v € R® (see Figure 5.2). 


Lemma 5.2 Every Householder matrix is symmetric and orthogonal. 


Proof As IT = I, (vvt)™ = (vT)Tvt = vv", and vv is a (positive 
real) number, the symmetry of H follows. The orthogonality of H isa 
consequence of the identity 
4 4 
Try T _ 72 T T TaN eet 
H- H=HdH” =H =1— a, UY vGgmgyeee \(ov')=T, 


since (vut)(vvt) = v(vTv)vut = (vt v)vv" by the associativity of ma- 


trix multiplication. 
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Lemma 5.3 Let 1 <k <n and suppose that Hy is ak x k Householder 
matriz. Then, the matric H € R"*", written in partitioned form as 


Ye 0 
H — 
( Of A ) 
where In, is the identity matrix of ordern—k and 0 is the (n—k) xk 
zero matriz, is also a Householder matriz. 


The proof of this lemma is straightforward and is left as an exercise. 
(See Exercise 1.) 


n 
* 


Lemma 5.4 Given any vector x € R®, there exists a Householder matrix 


H € Rim such that all elements of the vector Hx are zero, except the 
first; t.e., Ha is a nonzero multiple of e1, the first column of the identity 


matrix. 


In geometrical terms this result can be rephrased by saying that for 
any vector x € R® there exists an (n — 1)-dimensional hyperplane 
passing through the origin in R” such that the reflection Ha of x in H is 
equal to a nonzero multiple of e;. To find 1 it suffices to identify a vector 
v € R? normal to H. Since H is unaffected by rescaling v (see Definition 
5.6), the length of v is immaterial. As noted in the discussion following 
Definition 5.6, the vectors Ha, x and v are coplanar. Therefore, we 
shall seek v € R” as a suitable linear combination of # and e). 


Proof of lemma We seek H = I — [2/(vTv)] vvt with v = @ + cei, 
where c is a nonzero real number to be determined. Hence, 
via = wx x+cf, 


viv = w'e+2cb4c’, 


where 3 = ef @ is the first entry of 2. A simple manipulation then shows 
that 
(ce? —alax)x — 2c(aTx + cB)er 

ale + 2cB 4+ c? 


2 
Ha = «— —v(v' ax) = 
Pe oa, v(v x) 


Thus, Ha will be a multiple of e; provided that we choose c so that 
c? = a7. Also, to avoid division by 0, we need to ensure that a? a + 


2c8 +c? £0. To do so, note that c? > 67; therefore 
ate +2cB+c>(B+c)? #40, 
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provided that @ +c #4 0, which can be ensured by selecting the appro- 
priate sign for c, that is, by defining 


Nas (sign B)VaTta when G40, 


ale when G=0. 


With this choice of c, we have Ha = —cey, as required. 


We now show how Householder matrices can be used to reduce a given 
matrix to tridiagonal form. 


Theorem 5.7 Given that A € Ry and n > 3, there exists a matrix 


Qn € Rum, @ product of n — 2 Householder matrices H(n,~) € Royin 


sym? 


k=2,...,n—1, given by 
Qn = An n—1) 4 (n,n-2) oid A(n,2) 
such that QT AQn = Ty is tridiagonal; the matrix Qn is orthogonal. 


Proof The proof of the theorem will proceed by induction. Before em- 
barking on this, we make some preparatory observations which highlight 
the key ideas in the proof. 

Consider the matrix A € RY, partitioned by its first row and column 
in the form 


T 
Ae a b 
b C 
where a € R, bE R™-! and Ce RGY*), and define 
EP = {vy ER": v=(A,0,...,0)" for some \€ R}. 


If b happens to belong to R®~!, then, by Lemma 5.4, there exists an 
(n — 1) x (n—1) Householder matrix H,-, such that each element of 
H,-1b, except the first, is equal to 0. If, on the other hand, b = 0, then 


H,_1b = 0, trivially. Either way, H,_1b € ames 
(n—1)x(n—1) 


Let us extend the Householder matrix Hy-1 € Risym , using 
Lemma 5.3 with k = n—1, to a Householder matrix H(nn-1) © RYin 


by defining the (1,1)-entry of Hm »—1) as 1 and choosing the remaining 
entries in the first row and first column of H(,,,—1) as 0. Then, 


a 7 1 oT a bt 1 or 
Ann —1) 4H (n,n-1) a. ( 1: b C (0) Ay-1 


= & ar (5.27) 


=) 
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where 


d= H? ,b= H,_1b€ EP") and D= H} _,C Hn-1 € R&D * "9, 


sym 


As dc€ €?'', the first row and first column of Aon —1) 4A (nn—1) are of 
the desired form. It remains to transform the submatrix D to tridiagonal 
form. This will be achieved by proceeding inductively. 

If n = 3, then the 3 x 3 matrix Hoan) 4A anv) is automatically 


tridiagonal since d € €?, and we complete the proof by taking Q3 = 
(3,2). We note in passing that if f € €3, then 


OlP= Hof = Haft € €&, 


as the (1,1)-entry of H(3.2) is 1 and the remaining entries in its first 
column are all 0. 

Let us suppose that n > 4 and A € R{y’. Our inductive hypothesis 
is that the statement of the theorem has already been established for 
any real symmetric matrix of order n—1, i.e., D € Ree eee can be 
transformed into tridiagonal form: 


TOG es = Th-1; 
where Qn_1 € R@™-)*(@— is an orthogonal matrix that is a product 
of n — 3 Householder matrices, each of size (n — 1) x (n — 1) - 
Qn-1 = A(n—1,n—2) tos H(n—1,2) =< 


and Q?_,f € €?' for any vector f € E€]'*. This inductive hypothesis 
has already been verified above for 3 x 3 real symmetric matrices. 

We now extend each of the (n — 1) x (n — 1) matrices H(n_1,4), for 
k=2,...,n—2, ton x n Householder matrices Hin,%), k = 2,...,n—2, 
respectively, as in Lemma 5.3, and define 


Qn = Ann—1) A (n,n—2) #8 Fi(n,2) * 


( 1 oT ) ( a d ) ( 1 oT ) 
0 er d D 0 Qn-1 
Cota “ns ) 
nid Tn-1 , 
As d € €{'"', it follows from our inductive hypothesis that QT_,d also 


belongs to ‘tae and therefore the last matrix is tridiagonal. As Q, is 
a product of n — 2 Householder matrices, each of size n x n and each 


Then, by (5.27), 


Qn AQn 


I 
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orthogonal, Q, € R”*” is itself orthogonal. Moreover, for any f € E]! 
we have QTf € Ef, since the (1,1)-entry of Q, is 1 and the remaining 
entries in the first column of Q, are 0. This concludes the inductive 


step, and completes the proof. 


The recursive transformation of a symmetric matrix to tridiagonal 
form outlined in the proof of Theorem 5.7 is called Householder’s 
method. In implementing this method in practice it is important to 
carry out the transformations efficiently. Counting the arithmetic op- 
erations involved is straightforward but tedious, and shows that the 
complete reduction requires approximately an? multiplications, for a 
moderately large value of n. 


Example 5.6 In order to illustrate Householder’s method, we return 
to the matrix A defined in (5.16). The first stage uses the Householder 
matrix defined by the vector 


v = (0.000, 4.162, 2.000, 1.000, 2.000)". (5.28) 


The result of the transformation is the matrix 


4.000 —3.162 0.000 0.000 0.000 
—3.162 5.300 1.232 —0.332 0.284 
0.000 1.232 1.653 3.3812 0.275 
0.000 —0.332 3.312 5.149 1.123 
0.000 0.284 0.275 1.123 —3.102 


The leading element of the matrix is unchanged, and the first row and 
column have tridiagonal structure. 
The second stage uses the Householder matrix with the vector 


v = (0.000, 0.000, 2.540, —0.332, 0.284) (5.29) 
and gives the new matrix 


4.000 —3.162 0.000 0.000 0.000 
—3.162 5.300 —1.308 0.000 0.000 
0.000 —1.308 0.057 —2.166 0.792 
0.000 0.000 —2.166 6.610 0.420 
0.000 0.000 0.792 0.420 —2.967 


This time the leading 2 x 2 minor is unaltered, and the first two rows 
and columns have tridiagonal structure. 
The final stage uses the Householder matrix with vector 


v = (0.000, 0.000, 0.000, —4.471, 0.792)? (5.30) 
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and gives the tridiagonal matrix 


4.000 —3.162 0.000 0.000 0.000 
—3.162 5.300 —1.308 0.000 0.000 
0.000 —1.308 0.057 2.306 0.000 |. (5.31) 
0.000 0.000 2.306 5.208 —3.411 
0.000 0.000 0.000 —3.411 —1.565 


The numerical values are quoted here to three decimal digits, for sim- 
plicity. © 


Having shown how to transform a symmetric matrix into tridiagonal 
form, we can now consider the problem of determining the eigenvalues 
of a tridiagonal matrix. 


5.6 Eigenvalues of a tridiagonal matrix 


Before developing a numerical algorithm for calculating the eigenvalues 
and the eigenvectors of a symmetric tridiagonal matrix, let us spend 
some time exploring the location of the eigenvalues. The main result of 
this section is the so-called Sturm sequence property,! stated in Theorem 
5.9, which enables us to specify the number of eigenvalues of a symmetric 
tridiagonal matrix which exceed a given real number J. The proof of 
the Sturm sequence property is based on Cauchy’s Interlace Theorem 
which is of independent interest, and proving the latter is our first task. 

To simplify the notation we now write the symmetric tridiagonal ma- 
trix in the form 


ay by 
bz a2 bg 
bg a3 b4 


bn An 


1 Jacques Charles Francois Sturm (22 September 1803, Geneva, Helvetia (now 
Switzerland) — 18 December 1855, Paris, France). The results discussed here 
are based on Sturm’s paper ‘Mémoire sur la résolution des équations numériques’, 
published in Mémoires présentés par divers savants étrangers a l’Académie royale 
des sciences, section Sc. math. phys., 6, 273-318, 1835, concerning the number 
of roots of a polynomial in an interval. In 1826 Sturm made the first accurate 
determination of the velocity of sound in water working with the Swiss engineer 
Daniel Colladon. In 1840 Sturm succeeded Poisson in the chair of mechanics in 
the Faculté des Sciences in Paris. 
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The determinants of the successive principal minors of a matrix of this 
form can easily be calculated by recurrence. Defining p,(X) to be the 
determinant of the leading principal minor of order r of T — AI, we see 
that 


pr(A) = a, —A, 
po(A) = (a2 —A)(a, — A) — b5. 


Expanding p,(A) in terms of the elements of the last row, and then in 
terms of the last column, we obtain the relation 


Pr(A) >= (ay — r)pr—1(A) — be p,p_2(A), es 2,3,.-.,7, 


with the convention that 


In the rest of this section we shall assume that all the off-diagonal 
elements b; are nonzero. For suppose that b, = 0 for some & in the 
set {2,3,...,n}; then, the eigenvalues of the matrix T comprise the 
eigenvalues of the matrix consisting of the first k — 1 rows and columns, 
together with the eigenvalues of the matrix consisting of the last n—k+1 
rows and columns. These two problems become separated and can be 
treated independently; if several of the off-diagonal elements are zero, 
the matrix can be partitioned into a number of smaller matrices which 
can then be dealt with independently. 


Theorem 5.8 (Cauchy’s Interlace Theorem) Let n > 3. The roots 
of py separate those of pr4i, for r = 1,2,...,n —1; i.e., between two 
consecutive roots of pr41 there is exactly one root of the polynomial p,., 
r=1,2,...,.n—1. 


Proof The proof is by induction. It is trivial to show that the property 
holds for r = 1: the two roots 


ia + Ag (ay ~ ag)? + 463 


of pz are separated by aj, the only root of the linear polynomial pj. 
Suppose that the statement is true when r=i—1,2<i1<n-—1, 
so that the roots of p;_, separate those of p;. On denoting by a and 
two consecutive roots of p;, the inductive hypothesis implies that p;_1 
has exactly one root between a and 3, which means that p;_1(a@) and 
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pi-1(8) have opposite signs. Now, 
pisi(A) = (asa — A)pi(A) — b7.17i-1) , 


so that, as a and @ are roots of p;, it follows that pj,i1(a) and pj+1(G) 
also have opposite signs. Hence p;;; has at least one root between a 
and 3. Choosing a and @ to be each pair of consecutive roots of p; in 
turn we have therefore located 2 — 1 roots of pj41. 

Next choose a to be the algebraically smallest root of p;. It is easy to 
see that each of the polynomials py, p2,..., Pn tends to +oo as A — —o0. 
By the inductive hypothesis, p;_1 has no roots smaller than a, so pj_1(@) 
is positive; hence from the recurrence relation p;;1(a@) is negative, and 
therefore pj;1 must have a root smaller than a. A similar argument 
shows that p;+i has a root greater than the largest root of p;, so that 
we have located all the 2+ 1 roots of pj,;. There is exactly one root 
of p; between each pair of consecutive roots of p;41, and the interlacing 
property follows. 


We have shown in particular that all the roots of each p,. are distinct. 
Moreover p;(A) and p;_1(A) cannot both vanish for the same 4, for if 
this were to happen the recurrence relation would show that this value 
of A is a root of p, for all values of r € {0,1,...,n}; but po evidently 
never vanishes. 


Theorem 5.9 (The Sturm sequence property) Let us suppose that 
0 € R and consider the sequence p;(0), 1 = 0,1,...,n. The number of 
agreements in sign between consecutive members of the sequence is the 
same as the number of eigenvalues of the matrix T which are strictly 
greater than V. 


Proof Given that \ € R and 1 <j <n, we write s;(A) for the number 
of agreements in sign in the sequence 


po(A) »pi(A) ee »pj(A) ’ 


and g;(A) for the number of roots of the polynomial p; which are strictly 
greater than X. 

It is trivial to see that s1(0) = gi(¥). The proof now proceeds by 
induction. Let us suppose that 2 < k < n and adopt the inductive 
hypothesis that s,-1(0) = gx—-1(¥); we shall prove that s,(0) = gx(v). 

Under our hypothesis, either s,(V) = s,-1(0) +1, if p,(V) and pp_i(V) 
have the same sign, or s;,(V) = sx—-1(V) if they have opposite sign. Sup- 
pose that V lies in the interval between the two consecutive roots a@ and 
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GB of pp_-1. Then, there is exactly one root of p, between a and (; de- 
note this root by y. As we saw in the proof of the previous theorem 
px(A) is positive when A is large and negative, and the sign of pz(A) is 
determined by the number of roots of pz which are less than A. Hence 
if 0 < y both pz and pz_;1 have the same number of roots less than VJ, 
so that p,(V) and pp_i(V) have the same sign, and s,(V) = sx_1(¥) +1. 
Also if py and pz; have the same number of roots less than 0, then 
Pr must have one more root which is greater than J; this means that 
gk (0) = gx—1(¥) + 1. Hence s;(0) = g,(V). A similar argument shows 
that sx(0) = gx(V) in the alternative situation where 0 > y. It is also 
a simple matter to modify the argument slightly for the cases where 0 
is less than the smallest root of pp_1, or greater than the largest root of 
Pr—1, and so the inductive step is complete. 


The theorem and proof do not allow for any of the members of the 
sequence being zero, in which case the sign becomes undefined. A more 
careful analysis is tedious but not difficult; it shows that the theorem 
still holds if we adopt the convention that when p,;(v) is zero it is given 
the same sign as p;_(0). As we have already seen, two consecutive 
members of the sequence cannot both be zero. 

Our next example will illustrate the application of the Sturm sequence 


property. 


Example 5.7 Determine the second largest eigenvalue of the matria 


3 1 0 0 
1 -1 2 0O 

Al ee (5.32) 
O <0. a. “g 


If the eigenvalues are Aj, where A; > Az > A3 > Aa, we wish to find Xo. 
Now, it is easy to see from Theorem 5.5 that all the eigenvalues lie in 
the interval [—4, 4]. We take the midpoint of this interval, and evaluate 
the Sturm sequence with 7 = 0, giving 


po(0)=1, pi(0)=3, p2(0)=—4, p3(0) = —16, p4(0) = —12. 
In this sequence there are three agreements of sign: 
(1,3), (—4,-16) and (-—16,-12). 


Hence s4(0) = 3, and the matrix has three eigenvalues greater than 
0; this means that Az must lie in the right-hand half of the interval 
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[—4,4], that is, in [0,4]. We construct the Sturm sequence for 0 = 2, 
the midpoint of the interval, giving 


po(2)=1, pi(2)=1, p2(2)=—4, p3(2)=—0, pa(2) = 4. 
Notice that here p3(2) is zero, and is given the negative sign to agree 
with p2(2). The number of agreements in sign here is two, so two of the 


eigenvalues are greater than 2, and 2 must lie in [2,4], the right-hand 
half of the interval [0,4]. For ) = 3 we obtain the sequence 


Ley “FO5: ahi 244 =35 


with only one agreement of sign, so this time Aj must lie in the left-hand 
half [2, 3] of the interval [2,4], and we repeat the process, taking 0 = 3, 
the midpoint of [2,3]. This time the sequence is 
1 11 17 7 
2° 4’ 8’ 16’ 
with one agreement in sign, showing that Ag < 2.5. 
The process of bisection can be repeated as many times as required to 
locate the eigenvalue to a given accuracy. After 13 stages we find that 
A2 = 2.450 correct to three decimal digits. © 


1 


This method is very similar to the usual bisection process for finding 
a solution of f(a) = 0, beginning with an interval [a,b] such that f(a) 
and f(b) have opposite signs. A great advantage of the Sturm sequence 
method is that it not only determines the eigenvalue, but also indicates 
which eigenvalue it is. If we used the Jacobi method of Section 5.3 we 
would have to determine all the eigenvalues, sort them into order, and 
then choose the second largest eigenvalue as Ap. 

The Sturm sequence method will also determine how many eigenvalues 
of a matrix lie in a given interval (a, 3); all that we need is to construct 
the Sturm sequences (p;(a@));=0,1,...zn and (p;(8));=0,1,....zn; then, the re- 
quired number of eigenvalues is s,(@) — S)(). 

It is very important to calculate the sequence p,(W) directly from the 
recurrence relation. For instance, in Example 5.7, with 0 = 2.445 we 
obtain 
(2.445) = 1, 
( ) = 3-—2.445 = 0.555, 
p2(2.445) = (—1-— 2.445) x 0.555 -—1 x 1 = —2.9120, 

( ) (1 — 2.445) x —2.9120 — 4 x 0.555 = 1.9878, 

( ) (1 — 2.445) x 1.9878 — 1 x —2.9120 = 0.0396. 
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The alternative, to construct explicit forms for the polynomials p;(A), 
j =0,1,...,n, and then evaluate p;(v) by inserting the value of \ = 
into each of the polynomials p,;(A), will lead to the construction of the 
explicit form of the characteristic polynomial of the matrix, which is 
pn(A), and we have already seen that this is affected disastrously by 
rounding errors. The calculation by direct use of the recurrence relation 
is perfectly satisfactory. 


Example 5.8 As a second example, we return to the matrix A in (5.16), 
which has been transformed to the tridiagonal form (5.81), to determine 
the largest eigenvalue. 


Table 5.2. Bisection process for the largest eigenvalue. In the table k 
denotes the iteration number, VU; the kth iterate approximating the 
unknown eigenvalue A1, and s4(0,%) signifies the number of sign 
agreements in the Sturm sequence po(U,),...,pa(x)- 


rd 


Ve 


w 
w 
— 
S 
> 
Ne 


0.000 
5.463 
8.194 
6.829 
7.011 
7.853 
8.024 
8.109 
8.066 
8.088 
8.098 
8.093 
8.096 
8.094 
8.094 


BO OR Oe Se @ OT 


PRR RP Pe 
oRwWNrFOOAN WTB WHF 


Table 5.2 shows the result of the bisection process, using the Sturm 
sequence. The co-norm of the tridiagonal matrix is 10.926, so the pro- 
cess begins with the interval [—10.926, 10.926].1 The largest eigenvalue 


1 To explain this choice, let us note that if \ € C is an eigenvalue of A € C”*” and 


x € C” \ {0} is the corresponding eigenvector, then |A| |/z|| = ||Az|| = ||Aal| < 
\|A]| lal]; ze., [A] < || Al], for any subordinate matrix norm || - || and any eigenvalue 
X of A. 
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is 8.094, to three decimal digits, agreeing with the result of Jacobi’s 
method, in Section 5.3. This table also shows how some savings are 
possible when all the eigenvalues are required. We see from the table 
that use of v = 7.511 gives 1 agreement in sign, while 7 = 6.829 gives 2 
agreements in sign. The bisection process for the second largest eigen- 
value can therefore begin with the interval [6.829, 7.511]. © 


The method of bisection may appear rather crude, but it has the great 
advantage of guaranteed success, and is very little affected by rounding 
errors. Moreover, the amount of work involved is not large. If we have 
calculated the squares of the off-diagonal entries, b2, of the matrix T in 
advance, each computation of all members of the sequence requires about 
2n multiplications. If the bisection process is continued for 40 stages, 
the eigenvalue will be determined to about nine significant digits, and 
if we require to calculate m of the eigenvalues to this accuracy, we shall 
need about 80mn multiplications. If m is a good deal smaller than n, 
the order of the matrix, this is likely to be a great deal smaller than the 
work involved in the process of reduction to tridiagonal form, which, as 
we have seen, is about ane multiplications. In most practical problems 
it is the initial Householder reduction to tridiagonal form which accounts 
for most of the computational work. 


5.7 The QR algorithm 


In this section we discuss briefly the QR algorithm, an alternative method 
for determining the eigenvalues of a tridiagonal matrix. In principle it 
could be applied to a full matrix, but it is more efficient to use the 
Householder method to reduce the matrix to tridiagonal form first. The 
basis of the method is the QR factorisation of the matrix which we 
have already encountered in Chapter 2, in the solution of least squares 
problems. In contrast with Section 2.9, however, where we were con- 
cerned with the solution of least squares problems for rectangular ma- 
trices A € R™*”, here the focus is on eigenvalue problems for symmetric 
tridiagonal matrices A € R"*”; we shall therefore revisit the derivation 
of the QR factorisation by adopting a slightly different approach from 
the one proposed in Section 2.9. 


5.7.1 The QR factorisation revisited 


Suppose that n > 3 and A € R”*” is a symmetric tridiagonal matrix. 
We first show how to construct an orthogonal matrix Q € R”*” and 
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an upper triangular matrix R € R"*” such that A = QR; the prob- 
lem is similar to the LU factorisation used in solving systems of linear 
equations, but here we have an orthogonal matrix @ instead of a lower 
triangular matrix L. 

We construct the matrix Q as a product of plane rotation matrices 
RP?*1(o) € R"*” (see Definition 5.2), with a suitably chosen y. In 
order to explain what is meant here by ‘suitably chosen’, we note that 
in the product 


B= RPPH(p)A (5.33) 
the element b,+1, is easily found to be 


bp+1p = —S App + Cap+ip; 


where s = siny and c=cosy. We can make b,+1) = 0 by choosing 


nara ara p= (Gey + apii,)'? (5.34) 
We note in passing that 
bpp = CApp + 8Apt1p, 
bppt1 = CApp41 + SAp41 pti, 
bp+1ptl = —SApp41 + CAp4ip4i - 


The remaining elements of B are the same as those of A. 

To summarise the important points, upon multiplying the symmetric 
tridiagonal matrix A on the left by R??*1(y), where c = cosy and 
s = sing in R??*!(y) are chosen as indicated in (5.34), we obtain a 
tridiagonal matrix B = (b;;) € R"*” such that bp41p) = 0. 

After this brief preparation, we embark on the description of the QR 
factorisation. Let us suppose that we successively multiply A on the left 
by the n — 1 plane rotation matrices, 


Qi= RY (1) » Q2= R® (9) porters Qn = RET! (Gna) ) 
with ~1, Y2,---;n—1 Selected according to (5.34); more precisely, 
for p=1,2,...,n—1, 
Pp is chosen so as to set the (p+ 1,p)-entry of Q,...QiA 


to zero. 
Given that the elements below the diagonal of the matrix 


Qp-1---QiA, 25D Se hy 
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which are already equal to zero, remain zero upon multiplication by the 
next rotation matrix Q, in the sequence, we deduce that, after successive 
multiplications of A on the left by Q1, Qo,.--,Qn—1, the matrix 


Qn-1Qn-2..-QA=R, (5.35) 


is upper triangular. In fact, since A is tridiagonal, R is tridiagonal 
and upper triangular; consequently, R is bidiagonal in the sense that 
Ry =O0ifiAj,j —-1. 

As the matrices Q, = R??*1(pp), p = 1,2,...,n — 1, are orthog- 
onal, and therefore Q7Q> = TI, on multiplying (5.35) on the left by 
QT Qe ...Q7_,, we find that 


n—-1> 
A=OQR, 
where 
Q=Q7Q7...Qn-1 


is an orthogonal matrix (as it is a product of orthogonal matrices). The 
next subsection describes the QR algorithm, based on the QR factori- 
sation, for the numerical solution of the eigenvalue problem (5.1) where 
the matrix A € R”*” is symmetric and tridiagonal. 


5.7.2 The definition of the QR algorithm 


Suppose that A € R”*” is symmetric and tridiagonal. The QR algo- 
rithm defines a sequence of symmetric tridiagonal matrices A() € R"*”, 
k=0,1,2,..., starting with A = A, as follows. 

Suppose that k > 0. The kth step of the QR algorithm takes the 
symmetric tridiagonal matrix A) and chooses a shift pi, € R (the 
choice of uz will be discussed below), then forming the QR factorisation 


A®) — up = Q® R® | 


We then multiply Q™ and R™) in the reverse order, and construct the 
new matrix A+) defined by 


ACTH = RO QU) 4 pel. 


Recalling that the matrix Q) is orthogonal, it is a simple matter to see 
that A?+) = QMT AMO), so that A#+) and A™ have the same 
eigenvalues. As A) = A, all matrices in the sequence (A“)) have the 
same eigenvalues as A itself. It is also easy to show that each of the 
matrices A“) is symmetric and tridiagonal. (See Exercise 7.) 
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The choice of the shift parameter yu, is very important; if correctly 
chosen the sequence of matrices A‘) converges very rapidly to a matrix 
in which one of the off-diagonal elements is zero. If this element is in the 
first or last row, we have thereby identified one of the eigenvalues; if it is 
one of the intermediate elements, we can split the matrix into two sep- 
arate matrices of lower order. In either case we can repeat the iterative 
process with smaller matrices, until all the eigenvalues are found. 

The usual simple choice of the shift parameter in the kth step is 


Lk = alt) , 


the last diagonal element of the matrix A‘). In general, after a few 
steps of the iteration the element at position (n,n — 1) will become 
negligibly small. One of the eigenvalues of the resulting matrix is then 
the last diagonal element, and we continue the process with the matrix 
of order n — 1 obtained by removing the last row and column. There 
are special circumstances where this choice of shift is unsatisfactory, and 
other situations where another choice is more efficient, but we shall not 
discuss the details any further. The proof of the convergence of this 
method is long and technical; details will be found in the books cited in 
the Notes at the end of the chapter. 

The method does not determine the eigenvalues in any particular or- 
der, so if we require only a small number of the largest eigenvalues, for 
example, the Sturm sequence method is preferable. The usual recom- 
mendation is that the QR algorithm should be used on a matrix of order 
n if more than about + n of the eigenvalues are required. 


Example 5.9 We apply the QR algorithm to the tridiagonal matrix 
(5.31). 


After one step of the iteration the matrix A@ = RO QO + po I, with 


— (0) _ ce 
Ho = 455 = 455, 18 


7.034 —2.271 0 0 0 
—2.271 2.707 —0.744 0 0 
AY = 0 0.744 5.804 3.202 0 
0 0 3.202 —0.464 1.419 
0 0 0 1.419 —2.082 


In successive iterations k = 1,2,3,4,5, the element ast) has the values 
1.419, —1.262, 0.965, —0.223, 0.002, and after the next iteration a) 
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vanishes to 10 decimal digits. The element as’) is —3.282, which is 
therefore an eigenvalue. 

We then remove the last row and column, and continue the process 
on the resulting 4 x 4 matrix. After just one iteration the element at 
position (4,3) vanishes to 7 decimal digits, giving the eigenvalue —0.671. 
We remove the last row and column and continue with the resulting 3 x 3 
matrix. After one iteration of the resulting 3 x 3 matrix the element 
at position (3,2) is 0.0005, and another iteration gives the accurate 
eigenvalue 1.690. We are now left with a 2x2 matrix, and the calculation 
of the last two eigenvalues is trivial. The number of iterations required 
to isolate each eigenvalue reduces as the algorithm reduces the size of 
the matrix; this sort of behaviour is typical. 

The numerical values agree with those obtained by Jacobi’s method, 
and the bisection method. © 


5.8 Inverse iteration for the eigenvectors 


We saw in Section 5.3 that Jacobi’s method can also, if required, produce 
the eigenvectors of the matrix, but the use of Householder’s algorithm, in 
conjunction with the Sturm sequence method or the QR algorithm, only 
gives the eigenvalues. Suppose that A € R”*” is a symmetric matrix, 
and assume that we have a good approximation J € R to the required 
eigenvalue \ € R of A, and some approximation v© € R®, |v ||2 = 1, 
to the associated eigenvector v € R®, ||v||2 = 1. It is implicitly assumed 
that 0 # X and that V is not an eigenvalue of A, so that the matrix 
A — UI is nonsingular. The method of inverse iteration defines the 


sequence of vectors v\*), k = 0,1,..., as follows: given uv“) € R”, find 
w'*) € R® and then vt) € R” from 
(A-olw® = y®, (5.36) 
5.36 
yktl) cw) , 


where cy = 1/Vw®)Tw = 1/||w)||2. Hence, we conclude that 
|v |p = 1, k= 0,1,2,.... 


Theorem 5.10 Suppose that A € RY’. The sequence of vectors (v(*)) 
in R® defined in the process of inverse iteration (5.36) converges to the 
normalised eigenvector v € R® corresponding to the eigenvalue X € R 
which is closest to 0 € R, provided that X is a simple eigenvalue and the 


initial vector v© € R” is not orthogonal to the vector v. 
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Proof According to Theorem 5.1 (vii), the vector v© can be expressed 
as a linear combination of the (ortho)normalised eigenvectors 2) in R”, 
j=1,2,...,n, of the matrix A in the form 


vy) = = Dave , ay = VOT ZO, (5.37) 


Let A, € R denote the Sie of A which is closest to 0 € R. We shall 
prove that the sequence (v‘*)) converges, as k — 00, to the eigenvector 
v = «) € R® associated with \,, provided that a, = vTa® F 0. 
On expanding 


wo — S- bya) 
j=l 
inserting this expansion into the first line of (5.36) with k = 0 and 
comparing the resulting left-hand side with the expansion (5.37) of v 
on the right, we find that (A; — ¥)3; = a;. Our hypothesis that a, 4 0 
implies that A, 4 J. Further, as A, is the eigenvalue closest to VJ, it then 
follows that A; — ¥ #0 for all j € {1,2,...,n}. Hence, 


vo) = = cow) = Dre 


ze”. 


Repeating this argument for k = 1, - ...,m—1 gives 


v™ = Cm—1---Co > =a : (5.38) 
j ¥} 
Now v(™Ty(™ = 1, and therefore, 
-1/2 
Cee derite= |e, ena (5.39) 
ga1-S 4 


yi” a el — 


[Ebsco] tS (2) (RA) 


Lint wore tet ids (s) 
2 


Since 
As —U 
Aj —- 9 


[<1 V7 € {1,2,...,n}\ {5}, 


we find that limo. v(™ =a, = v; that completes the proof. 
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If the estimate v is within rounding error of A, and the eigenvalues 
are well spaced, the convergence of the sequence (v(*)) will be extremely 
rapid: usually a couple of iterations will be sufficient. 

The proof of Theorem 5.10 breaks down if a, = 0, 7.e., when the initial 
vector v is exactly orthogonal to the required eigenvector. However, 
this does not mean that the iteration (5.36) will also break down; for 
the effect of rounding error will almost always introduce a small multiple 
of the vector «*) into the expansion of v in terms of the 2% with 
j =1,2,...,n, and the required eigenvector will then be obtained in a 
small number of iterations. This is a useful property of the method, since 
in practice it is not possible to check whether or not v() is orthogonal 
to v, given that the eigenvector v is unknown. 

There will also be a problem if there is a multiple eigenvalue, or two 
eigenvalues are very close together: in the first case |A, — V|/|A; —V| = 1 
for some 7 # s, and the proof of Theorem 5.10 breaks down; in the 
second case |As — ¥|/|A; — ¥| & 1 for some 7 ¥ s, leading to very slow 
convergence. 

The computation of w“*) from (5.36) requires the solution of a system 
of linear equations whose matrix is A — JJ. This matrix will usually 
be nearly singular — in fact, our objective in choosing J was to make 
A— UI exactly singular. In general the solution of such a system is 
extremely dangerous, because of the effect of rounding errors; in this 
case, however, the effect of rounding error will be to introduce a multiple 
of the dominant eigenvector, and this is exactly what is required. An 
analysis of the effect of rounding errors will confirm this fact, but would 
take too long here.? 

There are two ways in which we can implement the inverse iteration 
process. One obvious possibility would be to use the original matrix 
A € R”*”, as implied in (5.36). An alternative is to replace A in this 
equation by the tridiagonal matrix T € R”*” supplied by Householder’s 
method. The calculation is then very much quicker, but produces the 
eigenvector of T; to obtain the corresponding eigenvector of A we must 
then apply to this vector the sequence of Householder transformations 
which were used in the original reduction to tridiagonal form. It is easy 
to show that this is the most efficient method. 

1 For further details, we refer to Sec. 4.3 in B. Parlett, The Symmetric Eigenvalue 

Problem, Prentice-Hall, Englewood Cliffs, NJ, 1980, and Section 7.6.1 in G.H. 


Golub and C.F. Van Loan, Matriz Computations, Third Edition, Johns Hopkins 
University Press, Baltimore, 1996. 
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Inverse iteration with the original matrix A € R”*” requires the LU 
decomposition of A, followed by one or more forward and backsubsti- 
tution operations. As we saw in Section 2.6, the LU decomposition 
requires approximately sn3 multiplications. The same process with the 
tridiagonal matrix T’, using the Thomas algorithm, involves only a small 
multiple of n multiplications. 

Having found an eigenvector of the tridiagonal matrix T € R"*", so 
that 


Tv =v, 
we use the fact that QT AQ = T to write 
AQv = AQv, 


so that the vector Qv is an eigenvector of A. Using Theorem 5.7, this 
means that the required eigenvector of A is 


FA(nn—1) ae Ai(n,2)¥ ’ 


where the matrices H(p,;) € R"*", 7 = 2,...,n —1, are Householder 
matrices. To multiply a vector « by a Householder matrix H = H(w) 
we write 
Ha = (I-auu')e2# =a2-a(u'a)u. 

Assuming that a = 2/(u'u) is known, this requires the calculation of 
the scalar product u'a, and then subtracting a multiple of the vector 
u from the vector #. This evidently involves 2n multiplications. Hence 
the calculation of Qv requires only 2n(n — 2) multiplications, and the 
work involved in the whole process is proportional to n?, instead of n°. 
In fact the total is less than 2n(n — 2), since a more careful count can 
use the fact that many of the elements in the vector u are known to be 
Zero. 


Example 5.10 Returning to the tridiagonal matrix (5.31), the QR al- 
gorithm has given an accurate eigenvalue which is 8.094 to three decimal 
digits. Beginning the inverse iteration (5.36) with a randomly chosen 
vector v € R®, we find that 


v)) = (—0.0249, —0.0574, —0.3164, 0.4256, 0.8455)". 


Successive iterations make no change in this vector, as might be expected, 
since the eigenvalue used was accurate to within rounding error. 
This is therefore the eigenvector of the tridiagonal matrix (5.31), to 
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within rounding error. To obtain the eigenvector of the original matriz 
(5.16) we multiply v in succession by the three Householder matri- 
ces defined by the vectors (5.80), (5.29) and (5.28). The result is the 


etgenvector 
v = (—0.0249, —0.5952, —0.1920, —0.2885, 0.7246)". 


Using this vector and the accurately calculated eigenvalue, we can check 
the result, and find that the elements of Av — Xv are of the same order 
as rounding error. 


5.9 The Rayleigh quotient 


In this section we develop a simple technique based on the concept of 
Rayleigh quotient, for obtaining an accurate approximation to an eigen- 
value of a symmetric matrix when a reasonably accurate approximation 
to the associated eigenvector is already available. 


Definition 5.7 Given a vector x € R? and a matrix A € RU*”, the 


sym? 
associated Rayleigh quotient R(x) is defined as the real number 
etAg 
-_ A 
Ra) = = (5.40) 


Clearly, if « € R® is an eigenvector corresponding to an eigenvalue 
\ € R of a matrix A € RU”, then R(x) = A. More generally, if x is 


sym? 
any nonzero vector in R”, then a number of further properties of the 


Rayleigh quotient are immediate deductions from the expansion of a in 
terms of the eigenvectors of A. 


Theorem 5.11 Suppose that the matrix A € RY" has the eigenvalues 
Aj; € R, j = 1,2,...,n, and the corresponding normalised eigenvectors 
eV) € R”", 7 = 1,2,...,n. If the vector x is expressed in terms of the 


1 John William Strutt, Lord Rayleigh (12 November 1842, Langford Grove (near 
Maldon), Essex, England — 30 June 1919, Terling Place, Witham, Essex, England). 
In 1879 Rayleigh wrote a paper on travelling waves which set the foundation for 
the modern theory of solitons. His theory of scattering (1871) was the first correct 
explanation of why the sky is blue: the intensity of light scattered from small 
particles is inversely proportional to the fourth power of the wavelength; for this 
reason, the intensity of the short-wavelength blue component dominates in the 
scattered light reaching our eyes. From 1879 to 1884 Rayleigh was the second 
Cavendish Professor of Physics at Cambridge, succeeding Maxwell, and he was 
awarded the Nobel prize in 1904 for the discovery of the gas argon. 
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eigenvectors ei), j=1,2,...,n, as 
a= > aj,e), (5.41) 


then 


Rw) = ANOS (5.42) 


On noting that «Ta is equal to 1 when i = j and to 0 otherwise, 
(5.42) follows trivially by inserting (5.41) into (5.40). 


Theorem 5.12 Let A € RUX*”. For any vector x € RY, 


sym 
Amin < R(x) < Awa, (5.43) 
where Amin € R and Amaxz € R are respectively the least and greatest of 


the eigenvalues of A. These bounds are attained when x is the corre- 
sponding eigenvector. 


Proof The inequalities follow immediately from (5.42) by noting that 
Amin < Aj S Amax:; j = 132,000.57: 


Theorem 5.13 Suppose that a € R® is a normalised vector, that is, 
||a||2 = 1. Assume, further, that 2) € R” is the kth normalised eigen- 
vector of A€ R"*", and that 


jz — x |p = O(e) 
for a smalle € R. Then, 
R(x) = Ax, + Ole”). 


Proof It follows from (5.41) that aTa*) = ay, and therefore, 


Jje- 23 = @-a®)T(@- a) 
= |lwl[2- 207 + jx |]2 
= 2(1 — ag). 


Hence, a, = 1+ O(e?). Further, 


1=|elB = Soa? 
j=l 
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= oft Dol 
iFk 
= 14O(e?)+ So ay. 
iFk 
Consequently, a; = O(e) for all 7 # k. The result then follows from 
(5.42) which (with }°"_, a¥ = ||a||3 = 1) yields that 


R(w) = Apaz+ a, dj 0% 
jek 
= AR+ O(e?) : 


This important result means that if we have a fairly close approxima- 
tion x to an eigenvector of A, then the Rayleigh quotient R(a) gives 
very easily a much more accurate approximation to the corresponding 
eigenvalue. 


5.10 Perturbation analysis 


It is often necessary to have an estimate of how much the eigenvalues 
and eigenvectors of a matrix are affected by changes in the elements. 
Such perturbations may arise, for example, when the matrix elements 
are obtained by physical measurements which are inexact, or they might 
result from finite difference approximations of a differential equation, as 
will be seen in Chapter 13. The last two theorems in this chapter address 
some of these questions. We begin with the following preliminary result. 
Theorem 5.14 Let M € RU, with eigenvalues A; and corresponding 
orthonormal eigenvectors v;,i =1,2,...,n, and suppose that u 4 0 and 
w are vectors in R” and pu is a real number such that 


(M —plu=w. (5.44) 
Then, at least one eigenvalue A; of M satisfies 


[Aj — HIS [lee|l2/llele- 


Proof If uw is equal to one of the eigenvalues the proof is trivial, so we 
shall assume that p 4 Ax, k = 1,2,...,n. We write the vectors wu and 
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w as linear combinations of the eigenvectors of M, so that 


n nm 
uUu= S QKVk , w= S BrUg 
pal pat 


Substituting in (5.44), we may equate coefficients of the linearly inde- 
pendent vectors vz, k = 1,2,...,n, to deduce that 


(Ap — War = Be, k=1,2,...,n. 


Now suppose that A; is the eigenvalue which is closest to ju; this means 
that 


|Aj — wl < |An — eI, k=1,2,...,n. 
Since the eigenvectors v;, 1 = 1,2,...,n, are orthonormal in R”, we have 
n n 
Soak = lulls, So Bz = |lewlld- 
k=1 k=1 


Hence 
3- AE  = ul 
Oya lls 


which gives 


2 “ 2 2 ' BR atts 2 2 
lols = DAE 2 Os —H) ae ease = (Aj — 1)" |lullo, 


k=1 


as required. 


We shall now use this result to show that in the case of a symmetric 
matrix A, small symmetric perturbations of A lead to small changes in 
the eigenvalues of A. 


Theorem 5.15 (Bauer—Fike Theorem (symmetric case)) Suppose 
that A,B € Ruy and B= A—E. Assume, further, that the eigenvalues 


of A are denoted by 5,7 = 1,2,...,n, and p is an eigenvalue of B. 
Then, at least one eigenvalue r; of A satisfies 


[Aj — #| < ||Elle- 
Proof This is a straightforward consequence of the previous theorem. 


Suppose that u is the normalised eigenvector of B corresponding to the 
eigenvalue 4, so that Bu = pu. Then, 


(A-plu=(B+E-plju= Eu. 
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It then follows from Theorem 5.14 that there is an eigenvalue A; of A 
such that 


[Aj — HI < ||[Eulle < [Elle llulle = [Elle , 


as required. 


Example 5.11 Consider the 3 x 3 Hilbert matrix 
1 1/2 1/3 


A=] 1/2 1/3 1/4 


and its perturbation 


1.0000 0.5000 0.3333 
B= 0.5000 0.3333 0.2500 
0.3333 0.2500 0.2000 


which results by rounding each entry of A to four decimal digits. 


In this case, E = A— B and ||E||z = 3.3 x 107°. Let pz be an eigenvalue 
of B; then, according Theorem 5.15, at least one of the eigenvalues Aj, 
A2, A3 of the matrix A satisfies the inequality 


[Aj — wu] < 3.3.x 107°. (5.45) 

Indeed, the true eigenvalues of A and B are, respectively, 

Ai = 0.002687338072, Ag = 0.1223270673, Az = 1.408318925 , 
and 

ft, = 0.002664493933, po = 0.1223414532, jw3 = 1.408294053. 
Therefore, 
Ay— 1 = 2.29x107°, Ag—pe = —1.44xK107°, Az3—p3 = 2.49x10~°, 
which is in agreement with (5.45). © 


5.11 Notes 


Theorem 5.15 is a special case of the following general result, known as 
the Bauer-Fike Theorem.! 


1 F.L. Bauer and C.T. Fike, Norms and exclusion theorems, Num. Math. 2, 137— 
141, 1960. 
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Theorem 5.16 Assume that A € C"*%” is diagonalisable; i.e., there 
exists a nonsingular matrix X € C"*”" such that X~!AX = A, where 
A is a diagonal matriz whose diagonal entries 43, 7 =1,...,n, are the 
eigenvalues of A. Suppose further that FE € C’*", B= A-—E, and p is 
an eigenvalue of B. Then, at least one eigenvalue r; of A satisfies 


|Aj — HLS Ko(X)|Ello, 


where K(X) = ||X|l2 || X—~1||2 is the condition number of the matrix X 
in the matriz 2-norm || - ||z on C?*". 


In the special case when A, EF € Ry, the matrix X can be chosen 
to be orthogonal; i.e., X~! = X7. Therefore, ||X||2 = || X~+||2 =1, and 
hence K(X) = 1, in accordance with the inequality stated in Theorem 
5.15. Theorems 5.15 and 5.16 estimate how far the eigenvalues of A are 
perturbed by changes in the elements of A. The question as to how large 
the changes in the eigenvectors may be is more difficult; it is discussed 


in detail in 


>» J.H. WILKINSON, The Algebraic Eigenvalue Problem, Clarendon Press, 
Oxford University Press, New York, 1988. 


Chapter 8 of Wilkinson’s book outlines the convergence proof of the QR 
iteration, while the convergence of Jacobi’s method is covered in Chapter 
5 of that book. For further details, see also Chapter 9 of 


» B. PARLETT, The Symmetric Eigenvalue Problem, Prentice-Hall, En- 
glewood Cliffs, NJ, 1980. 


Exercises 


5.1 Give a proof of Lemma 5.3. 


5.2 Use Householder matrices to transform the matrix 
2 1 2 2 
1 —7 6 5 
wa 2 6 2 —5 
2 5 —5 1 


to tridiagonal form. 
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5.4 


5.9 


5.6 
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Use Sturm sequences to show that no eigenvalue of the matrix 


3 1 0 0 


1 2 —2 0 
a= 0 —2 4 a 
0 0 Qa 1 


lies in the interval (0,1) if 5a? > 8, and that exactly one eigen- 
value of A lies in this interval if 5a? < 8. 
Given any two nonzero vectors « and y in R”, construct a 
Householder matrix H such that Ha is a scalar multiple of y; 
note that if Ha = cy, then c? = x'a/yty. Is the matrix 
unique? 
Suppose that the matrix D € R”"*” is diagonal with distinct 
diagonal elements dii, ..., dnn. Let A € Ry’, with laij| <1 
for alli, 7 € {1,2,...,n}, and assume that € € R is so small that 
e” can be neglected, and that the matrix D+ eA has eigenvalue 
A+ep and corresponding eigenvector e+eu. Show that A = dj; 
for some j € {1,2,...,n} and that ~ = a,;;. Write down the 
elements of e, and show that 
aij : 4 

Ui, = ey e, iA#j. 
Explain why the requirement that eigenvectors should be nor- 
malised implies that u; = 0. 
With the same notation as in Exercise 5, suppose now that 
dy = dog Se dkk; that dkk; dk+1,k+1; aa 5 dis, are distinct, 
and that ¢* can be neglected. Writing the matrices and the 
eigenvector in partitioned form, so that 


dil, + eA, eAg e+eut e-an 
eAt Dn—n + € Az ftevtety 


2 
to 9 e+teutex 
=(A+epte »( peepee, } 


show that A = d,, f = 0, and that p is an eigenvalue of A, 
with corresponding eigenvector e. Show how v is obtained from 
the solution of (D,_, — AI)v = —Afe, and that 


(Ai — p)u = ve — Aov. 


5.7 


5.8 


5.9 


5.10 


5.11 
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Explain how the vector u can be obtained in terms of the eigen- 
vectors and eigenvalues of the matrix A,, assuming that these 
eigenvalues are distinct. 

Suppose that A € Ruy’ is tridiagonal, that A — wl = QR and 
B=RQ+ wl, where » € R, Q € R”*” is a product of plane ro- 
tations and R € R”*” is upper triangular and tridiagonal. Show 
that B can be written as an orthogonal transformation of A, and 
that B is symmetric. Show also that the only nonzero elements 
in the matrix B which are below the diagonal lie immediately 
below the diagonal; deduce that B is tridiagonal. 

Perform one step of the QR algorithm, using the shift uw = ann, 


for the matrix 
0 1 
Te ( a ) | 


Show that the QR algorithm does not converge for this matrix. 
(This is a special case in which a different shift must be used.) 
Perform one step of the QR algorithm, using the shift uw = ann, 


for the matrix 
13 4 
a, ( 4 10 ) } 


Carry out two steps of inverse iteration for the matrix 


219 
fala a) 


using the eigenvalue estimate 7 = 5 and the initial vector 


1 
(0) — 
v Ga 


Verify that the elements of the vector v?) agree with those of 
the true eigenvector with an accuracy of about 5%. Evaluate 
the Rayleigh quotient using the vector v'), and verify that the 
result agrees with the true eigenvalue to about 1 in 3000. 

An eigenvalue and eigenvector of the matrix A may be evaluated 
by solving the system of nonlinear equations 


(A-Al)x = 0, 
vi2 = 1 


for the unknowns ’ and «. Using Newton’s method, starting 
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from estimates A and 2), show that the next iteration is 
determined by 


Aba — 6d 2 —(A- dO Dx , 


—g OT 6a = L(g OT z() =) 


and #@) = #) + 6a, AM) = d) +6. Comment on the differ- 
ence between this method and the method of inverse iteration 
in Section 5.8. 

Suppose that A € R®*” and that Jacobi’s method has produced 


sym 


l| 


an orthogonal matrix R and a symmetric matrix B such that 
B = R™AR. Suppose also that |bj;| < ¢ for all i 4 7. Show 
that, for each 7 = 1,2,...,n, there is at least one eigenvalue 
of A such that 


|r = b;5| < ev/n . 
Suppose that A € Riv and that the Householder reduction 


and QR algorithm have produced an orthogonal matrix @ and 
a tridiagonal matrix T such that T = Q™AQ. Suppose also that 
ltnn—1| < €. Show that there is at least one eigenvalue \ of A 
such that 


|A —tnn| <<. 
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Polynomial interpolation 


6.1 Introduction 


It is time to take a break from solving equations. In this chapter we con- 
sider the problem of polynomial interpolation; it involves finding a poly- 
nomial that agrees exactly with some information that we have about a 
real-valued function f of a single real variable x. This information may 
be in the form of values f(xo),..., f(an) of the function f at some finite 
set of points {xo,...,%n} on the real line, and the corresponding poly- 
nomial is then called the Lagrange interpolation polynomial! or, 
provided that f is differentiable, it may include values of the derivative 
of f at these points, in which case the associated polynomial is referred 
to as a Hermite interpolation polynomial.” 

Why should we be interested in constructing Lagrange or Hermite 
interpolation polynomials? If the function values f(x) are known for 
all x in a closed interval of the real line, then the aim of polynomial 


1 Joseph-Louis Lagrange (25 January 1736, Turin, Sardinia-Piedmont (now in 
Italy) — 10 April 1813, Paris, France) made fundamental contributions to the cal- 
culus of variations. He succeeded Euler as Director of Mathematics at the Berlin 
Academy of Sciences in 1766. During his stay in Berlin Lagrange worked on as- 
tronomy, the stability of the solar system, mechanics, dynamics, fluid mechanics, 
probability, number theory, and the foundations of calculus. In 1787 he moved 
to Paris and became a member of the Académie des Sciences. Napoleon named 
Lagrange to the Legion of Honour and as a Count of the Empire in 1808, and on 
3 April 1813, a week before his death, he received the Grand Croix of the Ordre 
Impérial de la Réunion. 

2 Charles Hermite (24 December 1822, Dieuze, Lorraine, France — 14 January 1901, 
Paris, France). Hermite did not enjoy formal examinations and had to spend 
five years to complete his undergraduate degree. He contributed to the theory 
of elliptic functions and their application to the general polynomial equation of 
the fifth degree. In 1873 he published the first proof that e is a transcendental 
number. Using methods similar to those of Hermite, Lindemann established in 
1882 that a was also transcendental. A number of mathematical entities bear 
Hermite’s name: Hermite orthogonal polynomials, Hermite’s differential equation, 
Hermite’s formula of interpolation and Hermitian matrices. 


179 


180 6 Polynomial interpolation 


interpolation is to approximate the function f by a polynomial over this 
interval. Given that any polynomial can be completely specified by its 
(finitely many) coefficients, storing the interpolation polynomial for f in 
a computer will be, generally, more economical than storing f itself. 
Frequently, it is the case, though, that the function values f(x) are 
only known at a finite set of points 7o,...,2n, perhaps as the results 
of some measurements. The aim of polynomial interpolation is then to 
attempt to reconstruct the unknown function f by seeking a polynomial 
Pn Whose graph in the (x, y)-plane passes through the points with co- 
ordinates (x;, f(xi)), i = 0,...,n. Of course, in general, the resulting 
polynomial p,, will differ from f (unless f itself is a polynomial of the 
same degree as p,,), so an error will be incurred. In this chapter we shall 
also establish results which provide bounds on the size of this error. 


6.2 Lagrange interpolation 


Given that n is a nonnegative integer, let P,, denote the set of all (real- 
valued) polynomials of degree < n defined over the set R of real numbers. 
The simplest interpolation problem can be stated as follows: given x9 
and yo in R, find a polynomial po € Po such that po(ao) = yo. The 
solution to this is, trivially, po(2) = yo. The purpose of this section is 
to explore the following more general problem. 


Let n > 1, and suppose that 2;,2 = 0,1,...,n, are distinct real num- 
bers (i.e., 2; A x; fori #7) and y,,2 = 0,1,...,n, are real numbers; 
we wish to find p, € P, such that p,(x;) = yi, i =0,1,...,n. 


To prove that this problem has a unique solution, we begin with a useful 
lemma. 

Lemma 6.1 Suppose that n > 1. There exist polynomials Ly € Pn, 
k=0,1,...,, such that 


Eules) = { : os (6.1) 


for alli,k =0,1,...,n. Moreover, 
Pn(x) = D> Lex) ye (6.2) 
k=0 


satisfies the above interpolation conditions; in other words, Pn © Pn 
OT Pr Bi) = = Op divers 
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Proof For each fixed k, 0 <k <n, Dx is required to have n zeros — xj, 
i=0,1,...,n, 74k; thus, L,(x) is of the form 
Ly (a) = Cy iG — 2%), (6.3) 
Zk 
where Ci, € R is a constant to be determined. It is easy to find the value 
of Ci, by recalling that L;,(2,) = 1; using this in (6.3) yields 


n 


1 
c= T[7—- 


n 


LC Ly 
Ly(x) = |] ar (6.4) 
ize 


As the function p, defined by (6.2) is a linear combination of the poly- 
nomials Ly € Pn, k= 0,1,...,n, also py € Py. Finally, p,(x;) = y; for 
i=0,1,...,n is a trivial consequence of using (6.1) in (6.2). 


Remark 6.1 Although the statement of Lemma 6.1 required that n > 1, 
the trivial case of n = 0 mentioned at the beginning of the section can 
also be included by defining, for n = 0, Lo(a) = 1, and observing that 
the function po defined by 


po(x) = Lo(x)yo (= yo) 
is the unique polynomial in Po that satisfies po(xo) = yo. 


We note that, implicitly, the polynomials L;,, k = 0,1,...,n, depend 
on the polynomial degree n, n > 0. To highlight this fact, a more accu- 
rate but cumbersome notation would have involved writing, for example, 
L(x) instead of L;,(x); this would have made it clear that L? (x) differs 
from L(x) when the polynomial degrees n and m differ. For the sake of 
notational simplicity, we have chosen to write L,(a); the implied value 
of n will always be clear from the context. 


Theorem 6.1 (Lagrange’s Interpolation Theorem) Assume that 
n>0. Let x;,1=0,...,n, be distinct real numbers and suppose that 
yi,t = 0,...,n, are real numbers. Then, there exists a unique polynomial 
Dn © Py such that 
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Proof In view of Remark 6.1, for n = 0 the proof is trivial. Let us 
therefore suppose that n > 1. It follows immediately from Lemma 6.1 
that the polynomial p,, € P,, defined by 


satisfies the conditions (6.5), thus showing the existence of the required 
polynomial. It remains to show that py, is the unique polynomial in P, 
satisfying the interpolation property 
Dn(Li) = Yi 1=0,1,...,n. 
Suppose, otherwise, that there exists gn € Pr, different from py, such 
that gn(v@i) = yi, t= 0,1,...,n. Then, pp — dn € Pn and pn — dn 
has n+ 1 distinct roots, 7;, 7 =0,1,...,n; since a polynomial of degree 
nm cannot have more than n distinct roots, unless it is identically 0, it 
follows that 
Pn(x) ir Qn (x) = 0, 


which contradicts our assumption that p, and gq, are distinct. Hence, 
there exists only one polynomial p,, € P, which satisfies (6.5). 


Definition 6.1 Suppose thatn > 0. Let x;,i1=0,...,, be distinct real 
numbers, and y;,1=0,...,n, real numbers. The polynomial p, defined 


by 


Pn(z) = >) Lie(a)ye (6.6) 


with L(x), k =0,1,...,n, defined by (6.4) when n > 1, and Io(x) = 1 
when n = 0, is called the Lagrange interpolation polynomial of 
degree n for the set of points {(xi, yi): 1 = 0,...,n}. The numbers x;, 
i=0,...,n, are called the interpolation points. 


Frequently, the real numbers y; are given as the values of a real- 
valued function f, defined on a closed real interval [a, b], at the (distinct) 
interpolation points x; € [a,b], i =0,...,n. 


Definition 6.2 Let n > 0. Given the real-valued function f, defined and 
continuous on a closed real interval [a,b], and the (distinct) interpolation 
points x; € [a,b], i=0,...,n, the polynomial py, defined by 
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= S° La(x) f(x) (6.7) 
k=0 


is the Lagrange interpolation polynomial of degree n (with in- 
terpolation points z;, i =0,...,”) for the function f. 


Example 6.1 We shall construct the Lagrange interpolation polynomial 
of degree 2 for the function f: x ++ e*® on the interval [—1,1], with 
interpolation points x) = —1, 41 =0, 2 =1. 


As n= 2, we have that 


-. Aes) a2) ite 
Lo(z) _ (xo —2£1)(9 — £2) ao} ( 1) 7 


Similarly, Ly(2) = 1— a? and Lo(x) = 4a(x +1). Therefore, 
po(z) = 92(z@— le" +(1—2’)e? + pa(x+ ler 


Thus, after some simplification, po(x) = 1+ asinh1+2x?(cosh1—1). O 


Although the values of the function f and those of its Lagrange inter- 
polation polynomial coincide at the interpolation points, f(a) may be 
quite different from p,,(a#) when x is not an interpolation point. Thus, it 
is natural to ask just how large the difference f(x) — pn(x) is when 
x #2, 1 = 0,...,n. Assuming that the function f is sufficiently 
smooth, an estimate of the size of the interpolation error f(x) —pp(z) 
is given in the next theorem. 


Theorem 6.2 Suppose that n > 0, and that f is a real-valued function, 
defined and continuous on the closed real interval [a,b], such that the 
derivative of f of ordern+ 1 exists and is continuous on [a,b]. Then, 
given that x € [a,b], there exists € = €(x) in (a,b) such that 


F(&) — Pn(x) = FO nail), (6.8) 
where 

Tn41(2) = (@ —29)...(@—2n). (6.9) 
Moreover 

|f(@) — Pr(x)| < Son ri /tm+1(2)| , (6.10) 
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where 
Mnsi1 = max (rest) ‘ 
tare ce [a,b] \f (¢)| 


Proof When « = x; for some i, 1 = 0,1,...,n, both sides of (6.8) are 
zero, and the equality is trivially satisfied. Suppose then that x € [a, }] 


and « # a;, 1 = 0,1,...,n. For such a value of a, let us consider the 
auxiliary function t + y(t), defined on the interval [a, b] by 
f(&) = pn(x) 


o(t) = f(t) — Pn(t) Tnr+1(t) - (6.11) 


Tri (2) 
Clearly y(a;) = 0, 7 = 0,1,...,n, and y(x) = 0. Thus, y vanishes at 
n +2 points which are all distinct in [a,b]. Consequently, by Rolle’s 
Theorem, Theorem A.2, y’(t), the first derivative of y with respect to t, 
vanishes at n+ 1 points in (a,b), one between each pair of consecutive 
points at which ~ vanishes. 

In particular, if n = 0, we then deduce the existence of € = €(x) 
in the interval (a,b) such that y’(€) = 0. Since po(x) = f(xvo) and 
w(t) =t — xo, it follows from (6.11) that 


f(x) = pol®) 


1 (x) 


0=9'(€) = f'(€)- ; 
and hence (6.8) in the case of n = 0. 

Now suppose that n > 1. As y’(t) vanishes at n+1 points in (a, 6), one 
between each pair of consecutive points at which y vanishes, applying 
Rolle’s Theorem again, we see that y” vanishes at n distinct points. Our 
assumptions about f are sufficient to apply Rolle’s Theorem n+ 1 times 
in succession, showing that y("*+!) vanishes at some point € € (a,b), the 
exact value of € being dependent on the value of x. By differentiating 
n+ 1 times the function y with respect to t, and noting that p, is a 
polynomial of degree n or less, it follows that 


0 = eer = porn — LO = Pl) 4 ry 


Tn41(2) 


Hence 


In order to prove (6.10), we note that as f("+)) is a continuous 
function on [a,b] the same is true of |f(+|. Therefore, the function 
a+ | f+) (z)| is bounded on [a, b] and achieves its maximum there; so 
(6.10) follows from (6.8). 
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It is perhaps worth noting that since the location of € in the interval 
[a,b] is unknown (to the extent that the exact dependence of € on x 
is not revealed by the proof of Theorem 6.2), (6.8) is of little practical 
value; on the other hand, given the function f, an upper bound on the 
maximum value of f+) over [a, }] is, at least in principle, possible to 
obtain, and thereby we can provide an upper bound on the size of the 
interpolation error by means of inequality (6.10). 


6.3 Convergence 


An important theoretical question is whether or not a sequence (p,) of 
interpolation polynomials for a continuous function f converges to f as 
n — oo. This question needs to be made more specific, as p, depends 
on the distribution of the interpolation points 7;, 7 = 0,1,...,n, not 
just on the value of n. Suppose, for example, that we agree to choose 
equally spaced points, with 


a;=a-+jlb—a)/n, f= 10, Vege hs WSL 


The question of convergence then clearly depends on the behaviour of 
My+41 as n increases. In particular, if 


Mn+1 
Met max [tn 4i(0)| = 0, 
aie (n + 1)! penal Le +1(2)| 


then, by (6.10), 
lim max |f(x) — pp(x)| =0, (6.12) 


n—0o x€[a,b] 
and we say that the sequence of interpolation polynomials (p,), with 
equally spaced points on [a,b], converges to f as n — oo, uniformly on 
the interval [a, 5]. 
You may now think that if all derivatives of f exist and are continuous 
on [a,b], then (6.12) will hold. Unfortunately, this is not so, since the 
sequence 


Gaz ae Irnsi(e)l) 
x€ [a,b] 


may tend to oo, as n — oo, faster than the sequence (1/(n + 1)!) tends 
to 0. 

In order to convince you of the existence of such ‘pathological’ func- 
tions, we consider the sequence of Lagrange interpolation polynomials 
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Table 6.1. Runge phenomenon: n denotes the degree of the 
interpolation polynomial p,, to f, with equally spaced points on [—5, 5]. 
‘Max error’ signifies maxze[—5,5) |f(£) — Pn(@)|- 


Degree n Max error 


2 0.65 
4 0.44 
6 0.61 
8 1.04 
10 1.92 
12 3.66 
14 7.15 
16 14.25 
18 28.74 
20 58.59 
22 121.02 
24 252.78 
Pn, n= 0,1,2,..., with equally spaced interpolation points on the inter- 
val [—5, 5], to 
1 
Ma)= pay E55). 


This example is due to Runge,! and the characteristic behaviour ex- 


hibited by the sequence of interpolation polynomials p, in Table 6.1 is 
referred to as the Runge phenomenon: Table 6.1 shows the maxi- 
mum difference between f(x) and p,(a) for —5 < a < 5, for values of 
n from 2 up to 24. The numbers indicate clearly that the maximum 
error increases exponentially as n increases. Figure 6.1 shows the inter- 
polation polynomial p19, using the equally spaced interpolation points 
x; = —-5+ 9, 7 =0,1,...,10. The sizes of the local maxima near +5 
grow exponentially as the degree n increases. 

Note that, in many ways, the function f is well behaved; all its deriva- 


1 Carle David Tolmé Runge (30 August 1856, Bremen, Germany — 3 January 1927, 
Gottingen, Germany) studied mathematics and physics at the University of Mu- 
nich. His doctoral dissertation in 1880 was in the area of differential geometry. 
Gradually, his research interests shifted to more applied topics: he devised a nu- 
merical procedure for the solution of algebraic equations where the roots were 
expressed as infinite series of rational functions of the coefficients, and in 1887 
he started to work on the wavelengths of the spectral lines of elements. In 1904 
Runge became Professor of Applied Mathematics in G6ttingen. He was a fit and 
active man: on his 70th birthday he entertained his grandchildren by performing 
handstands. A few months later he suffered a fatal heart attack. 


6.4 Hermite interpolation 187 


4 


t- 1.5 


Fig. 6.1. Polynomial interpolation of f: ++ 1/(1 +2) for x € [-5,5]. The 
continuous curve is f; the dotted curve is the associated Lagrange interpolation 
polynomial pio of degree 10, using equally spaced interpolation points. 


tives are continuous and bounded for all a € [—5,5]. The apparent di- 
vergence of the sequence of Lagrange interpolation polynomials (p,) is 
related to the fact that, when extended to the complex plane, the Taylor 
series of the complex-valued function f: z ++ 1/(1+ 2?) converges in the 
open unit disc of radius 1 but not in any disc of larger radius centred 


at z= 0, given that f has poles on the imaginary axis at z = +2. Some 
further insight into this problem is given in Exercise 11, and a similar 
difficulty in numerical integration is discussed in Section 7.4. 


6.4 Hermite interpolation 


The idea of Lagrange interpolation can be generalised in various ways; 
we shall consider here one simple extension where a polynomial p is 
required to take given values and derivative values at the interpolation 
points. Given the distinct interpolation points 2;,7 = 0,...,n, and two 
sets of real numbers y;, 7 = 0,...,n, and z;,2=0,...,n, with n > 0, we 
need to find a polynomial pon41 € Poni such that 


Pon+41(Xi) = Vi; Pon41 (21) = 2, a =0)...,7. 
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The construction is similar to that of the Lagrange interpolation poly- 
nomial, but now requires two sets of polynomials H, and Ky, with 
k =0,...,n; these will be defined in the proof of the next theorem. 


Theorem 6.3 (Hermite Interpolation Theorem) Let n > 0, and 
suppose that x;, i = 0,...,n, are distinct real numbers. Then, given 
two sets of real numbers y;, 1 =0,...,n, and z;, i1=0,...,n, there is a 
unique polynomial pon41 In Pony. such that 


Ponti(®i) = Yi; Pons (2) S20 be Ow seit (6.13) 
Proof Let us begin by supposing that n > 1. As in the case of Lagrange 


interpolation, we start by constructing a set of auxiliary polynomials; 


we consider the polynomials H; and K;,, k = 0,1,...,n, defined by 
Ay(x) = [Lp(x)]?(1 — 2L;,(ax)(x — xe), on 
K(x) = [Lx(x))?(e@— ae), 


where 


Clearly Hy, and Ky, k = 0,1,...,n, are polynomials of degree 2n+1. It 
is easy to see that Hy(x;) = Ky (ai) = 0, Hj,(ai) = Kj,(ai) = 0 whenever 
i,k € {0,1,...,n} and i 4 k; moreover, a straightforward calculation 
verifies their values when 7 = k, showing that 


vee e 7 Hi(i)=0,  i,k=0,1,...,n, 
1, i=k 
Ki(a:) = 0, Kile) = { 9) oy i,k =0,1,...,n. 


We deduce that 


n 
Ponti(@) = > _[Ha(a)yn + Ke (x) 2] 
k=0 
satisfies the conditions (6.13), and p2y+1 is clearly an element of P2n+1. 
To show that this is the only polynomial in P2,41 satisfying these 
conditions, we suppose otherwise; then, there exists a polynomial qon41 
in Pon41, distinct from po,41, such that 


don41(%i) = Yy; and Gilt) =e i=0,1,...,n. 
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Consequently, pon+1 — g2n+1 has n+ 1 distinct zeros; therefore, Rolle’s 
Theorem implies that, in addition to the n+ 1 zeros z;, i=0,1,..., 
Pon+1 — Vn41 Vanishes at another n points which interlace the x;. Hence 
Pon+1 — Gan41 © Pan has 2n+1 zeros, which means that p41 —G4n41 is 
identically zero, so that pan+1 — don+1 is a constant function. However, 
(Ponti — Gan41)(2i) = 0 for 1 = 0,1,...,n, and hence pon+41— gon41 = 0, 
contradicting the hypothesis that pan+1 and gan+1 are distinct. Thus, 
Pon+1 1S unique. 

When n = 0, we define Ho(x) = 1 and Ko(x) = x — 20, which corre- 
spond to taking Lo(a) = 1 in (6.15). Clearly, p; defined by 


n, 


pi(x) = Ho(x)yo + Ko(x)20 = yo + (x — Xo) 20 


is the unique polynomial in P; such that pi(xo) = yo and p{(ao) = 20. 


Definition 6.3 Let n > 0, and suppose that x;,1=0,...,n, are distinct 
real numbers and y;,z;, 1 =0,...,n, are real numbers. The polynomial 
Pan4i defined by 


Pansi(®) = ) [He (x) yn + Ke(a) 24] (6.15) 
k=0 


where H;,(x) and Kx(x) are defined by (6.15), is called the Hermite 
interpolation polynomial of degree 2n+ 1 for the set of values given 
in {hey Ges Bip t= 0, ae : Nf. 


Example 6.2 We shall construct a cubic polynomial p3 such that 
p3(0)=0, ps3(1)=1, p3(0)=1 and p3(1)=0. 
Here n = 1, and since p3(0) = p3(1) = 0 the polynomial simplifies to 
p3(x) = Hi(x) + Ko(a). 
We easily find that, with n = 1, x) =0 and xz; = 1, 
Io(a)=1-a2, In(«)=a, 


y(2) = [Ly(#)2(1 — 2L4(#1)(@ — 21) = (3 — 22), 
Kol) = [Lo(e)?(#— 2) =(1—2)?e. 
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These yield the required Hermite interpolation polynomial, 
p3(z) = —e +2? +o. 
© 


Definition 6.4 Suppose that f is a real-valued function, defined on the 
closed interval [a,b] of R, and that f is continuous and differentiable on 
this interval. Suppose, further, that n > 0 and that x;, 1=0,...,n, are 
distinct points in [a,b]. Then, the polynomial pon+1 defined by 


Pansi(2) = )[He(a)f (we) + Ke(2) f(x) (6.16) 


k=0 


is the Hermite interpolation polynomial of degree 2n +1 with 
interpolation points z;, i=0,...,n, for f. It satisfies the conditions 


Ponti (xi) — F(a), Pon41(%i) =f" (a), 7=0,...,n. 


Pictorially, the graph of p2n+41 touches the graph of the function f at 
the points 2;, 7=0,...,n. 

To conclude this section we state a result, analogous to Theorem 6.2, 
concerning the error in Hermite interpolation. 


Theorem 6.4 Suppose that n > 0 and let f be a real-valued function, 
defined, continuous and 2n+2 times differentiable on the interval [a, b], 
such that f2"+?) is continuous on [a,b]. Further, let pan4i denote the 
Hermite interpolation polynomial of f defined by (6.16). Then, for each 
x € [a,b] there exists € = (x) in (a,b) such that 


(2n+2) 
f(@) — pan4i(&) = ‘On aay nor D (6.17) 
where Tn41 is as defined in (6.9). Moreover, 
Mon 
|f(@) — Pan+1(x)| < Gn ait (o)) , (6.18) 


where Man42 = max¢efa,y |fC"*? (Q)|. 


Proof The inequality (6.18) is a straightforward consequence of (6.17). 
In order to prove (6.17), we observe that it is trivially true if 7 = x; 
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for some 7, i = 0,...,n; thus, it suffices to consider x € [a,b] such that 
x#a;,i1=0,...,n. For such 2, let us define the function t +> 7(t) by 


f(x) = Pensi(2) 2 

W(t) = f(t) — pengi(t) Frees GOT? [Tr+1(t)]° - 
Then, v(a;) = 0 for ¢ = 0,...,n, and also v(x) = 0. Hence, by 
Rolle’s Theorem, ~’(t) vanishes at n+ 1 points which lie strictly be- 
tween each pair of consecutive points from the set {xo,...,%n,x}. Also 
w'(a;) = 0, i =0,...,n; hence ~’ vanishes at a total of 2n + 2 distinct 
points in [a,b]. Applying Rolle’s Theorem repeatedly, we find eventu- 
ally that ~@"+?) vanishes at some point € in (a,b), the location of € 
being dependent on the position of x. This gives the required result on 
computing ~?"*?)(t) from the definition of w above and noting that 
pert?)(2) = 0 and py" (t) =0. 
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From the Lagrange interpolation polynomial p,,, defined by (6.7), which 
is an approximation to f, it is easy to obtain the polynomial p’,, which 
is an approximation to the derivative f’. The polynomial p/, is given by 


Pr(t) = So Ly(2) fe), mn 21. (6.19) 
k=0 


The degree of the polynomial p’, is clearly at most n — 1; pi, is a linear 
combination of the derivatives of the polynomials Ly, € Py, the coeffi- 
cients being the values of f at the interpolation points x,, k = 0,1,...,n. 
In order to find an expression for the difference between f’(x) and the 
approximation p/,(x), we might simply differentiate (6.8) to give 


(n+1) (¢(4 
F(a) — pyle) = Fe (SD a ala)) 


However, the result is not helpful: on application of the chain rule, the 
right-hand side involves the derivative d€/dax; the value of € depends on 
x, but not in any simple manner. In fact, it is not a priori clear that the 
function x ++ €(a) is continuous, let alone differentiable. An alternative 
approach is given by the following theorem. 


Theorem 6.5 Let n > 1, and suppose that f is a real-valued function 
defined and continuous on the closed real interval [a,b], such that the 
derivative of ordern+1 of f is continuous on [a,b]. Suppose further that 


192 6 Polynomial interpolation 


xi, 1=0,1,...,n, are distinct points in [a,b], and that pn € Pry is the 
Lagrange interpolation polynomial for f defined by these points. Then, 
there exist distinct points n;, 1 = 1,...,n, in (a,b), and corresponding 
to each x in [a,b] there exists a point € = €(x) in (a,b), such that 


fens) 


f'(2) ~ pa) = ai), (6.20) 


where 


* 


T(&) = (@—m)---(@— Mn) - 


Proof Since f(x;) — pn(ai) = 0, i = 0,1,...,n, there exists a point 1; 
in (a@j-1,2;) at which f’(m:) — pi,(m) = 0, for each i = 1,...,n. This 
defines the points n;, 2 = 1,...,n. Now the proof closely follows that of 
Theorem 6.2. 

When x = 7; for some i € {1,...,n}, both sides of (6.20) are zero. 
Suppose then that x is distinct from all the n;, 7 = 1,...,n, and define 
the function t +> y(t) by 


This function vanishes at every point 7;, 7 = 1,...,n, and also at the 
point t = x. By successively applying Rolle’s Theorem we deduce that 
x vanishes at some point €. The result then follows as in the proof of 
Theorem 6.2. 


Corollary 6.1 Under the conditions of Theorem 6.5, 


LF (a) — pola)| < Me at (ay| < CHO" Moss 


for all x in [a,b], where Mn41 = maxzefap |f"t) (x)|- 


In particular, we deduce that if f and all its derivatives are defined 
and continuous on the closed interval [a, b], and 
(6— a)” Mn4i = 
noo n! : 
then limn—o. Max z¢[a,p] | f’(«) — p},(x)| = 0, showing the convergence of 
the sequence of interpolation polynomials (p/,) to f’, uniformly on [a, 0]. 
The discussion in the last few paragraphs may give the impression that 
numerical differentiation is a straightforward procedure. In practice, 
however, things are much more complicated since the function values 
f(a), i=0,1,...,n, will be polluted by rounding errors. 
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Example 6.3 Consider, for example, a real-valued function f that is 
defined, continuous and differentiable on the closed interval [—h, h] of 
the real line, where h > 0. Suppose that f has been sampled at the 


points x9 = —h and x, = h, and that f(+h) are known, but only up 
to rounding errors €+, respectively. Consider the Lagrange interpolation 
polynomial p; € Pi for f that passes through the points (—h, f(—h)) and 
(h, f(h)); clearly, 


f(h) = Fh) 


pa (2) = OO eth) + (A). 
Differentiating this with respect to x yields 
f(h) = f(—2) 
f eet, a 


Now, p/, is a polynomial of degree 0, representing an approximation to 
f' (x) at any x € [—h, A], and in particular to f’(0). Unfortunately, in the 
presence of rounding errors only f(—h)+e_ and f(h)+e+ are available, 
with e+ unknown; thus, we can only calculate 


(f(A) + e+) — (F(-h) + €-) 
a 7 (6.21) 


Rewriting this in the form 


fh) = f(oh) | eye 
2h 2h? 
we see that even though the first fraction converges to f’(0) as the spa- 
cing 2h between the interpolation points —h and h tends to 0, for e,—e_ 
nonzero and fixed the second fraction will tend to infinity as h — 0. 
Thus, if h is too small in comparison with |e, — ¢_|, our approximation 
to f’(0) will be polluted by a large error of size |e, — ¢_|/(2h), whereas 
if h is very large in comparison with |e, —e¢_|, then |e, — ¢_|/(2h) will 
be small, but (f(h) — f(—h))/(2h) may be a poor approximation to the 
value f’(0). These observations indicate the existence of an ‘optimal’ h, 
depending on the size of the rounding error, for which the error between 
f’(0) and the approximation (6.21) is smallest. (See Exercise 12 for fur- 
ther details.) © 


Convergence, as h — 0, of the expression p/ (x) = (f(h) — f(—h))/(2h) 
to f’(0) in the last example should not be confused with convergence, as 
n — oo, of the sequence of polynomials (p/,) to the function f’ discussed 
just prior to the example. In the former case, the polynomial degree is 
fixed and the spacing between the two interpolation points, x9 = —h 
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and x; = h, tends to 0; in the latter case, the degree of the polynomial 
pi, tends to infinity and consequently the spacing between the increas- 
ing number of consecutive interpolation points shrinks. Nevertheless, 
Example 6.3 illustrates the issue that caution should be exercised in the 
course of numerical differentiation when rounding errors are present. 


6.6 Notes 


The interpolation polynomial (6.6) was discovered by Edward Waring 
(1736-1798) in 1776, rediscovered by Euler in 1783 and published by 
Joseph-Louis Lagrange (1736-1813) in his Legons élémentaires sur les 
mathématiques, Paris, 1795. 

Lagrange’s interpolation theorem is a purely algebraic result, and it 
also holds in number fields different from the field of real numbers con- 
sidered in this chapter. In particular, it holds if the numbers x; and 
yi, += 0,1,...,n, are complex, and the polynomial p, has complex co- 
efficients. Theorem 6.2 is due to Augustin-Louis Cauchy (1789-1857). 
The interpolation polynomial (6.15) was discovered by Charles Hermite 
(1822-1901). 

Before modern computers came into general use about 1960, the evalu- 
ation of a standard mathematical function for a given value of x required 
the use of published tables of the function, in book form. If x was not 
one of the tabulated values, the required result was obtained by inter- 
polation, using tabulated values close to x. The tabulated values were 
given at equally spaced points, so that usually x; = jh, where h is a 
fixed increment. In this case the Lagrange formula can be simplified; 
as this sort of interpolation had to be done frequently, various devices 
were used to make the calculations easy and quick. Older books, such 
as F.B. Hildebrand’s Introduction to Numerical Analysis, published in 
1956, contain extensive discussions of such special methods of interpo- 
lation, some of which date back to the time of Newton, but are now 
mainly of historical interest. A notable early contribution to the devel- 
opment of mathematical tables is the work of Henry Briggs (1560-1630), 
Savilian Professor of Geometry and fellow of Merton College in Oxford, 
entitled Arithmetica logarithmica, published in 1624. It contained ex- 
tensive calculations of the logarithms of thirty thousand numbers to 14 
decimal digits; these were the numbers from 1 to 20000 and from 90000 
to 100000. It also contained tables of the sin function to 15 decimal 
digits, and of the tan and sec functions to 10 decimal digits. 
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Exercises 


Construct the Lagrange interpolation polynomial p; of degree 1, 
for a continuous function f defined on the interval [—1, 1], using 
the interpolation points 79 = —1, x; = 1. Show further that 
if the second derivative of f exists and is continuous on (0, 1], 
then 


Ifa) — pia) < 20-2”) < 
where Mz = maxze(-1,1]|f”(x)|. Give an example of a function 
f, and a point x, for which equality is achieved. 

(i) Write down the Lagrange interpolation polynomial of degree 
1 for the function f: x +> 2°, using the points 2p = 0, v1 = a. 
Verify Theorem 6.2 by direct calculation, showing that in this 
case € is unique and has the value € = 3(x + a). 

(ii) Repeat the calculation for the function f: « + (2% — a) 
show that in this case there are two possible values for €, and 
give their values. 

Given the distinct points 7;, i =0,1,...,n +1, and the points 
yi, t = 0,1,...,n +1, let q be the Lagrange polynomial of 
degree n for the set of points {(2;,y;): ¢ = 0,1,...,n} and 
let r be the Lagrange polynomial of degree n for the points 
{(vi, yi): (= 1,2,...,n +1}. Define 


x €[-1,1], 


4, 
’ 


p(x) = (x = xo)r(x) imi e- Ln41)q(2) ; 
Tn+1 — Xo 


Show that p is the Lagrange polynomial of degree n + 1 for the 
points {(2;,y;): i =0,1,...,n+ 1}. 
Let n > 1. The points x; are equally spaced in [—1, 1], so that 


27 — 
ny = 2, j=0,...,”. 


With the usual notation 
Tn4i(x) = (a —20)...(a@— £n), 


show that 

(2n)! 
Bas EN) Salah 
Using Stirling’s formula 


Nin VatNNt2e-N NN - 0, 
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verify that 
n+1/2,.-n 
Rees 
n 

for large values of n. 
Let n > 1. Suppose that 2;, 7 = 0,1,...,n, are distinct real 
numbers, and y;,u;, 7=0,1,...,n, are real numbers. Suppose, 
further, that there exists pon41 € Pan+1 such that pon4i(zi) = 
y; for alli: = 0,1,...,n, and p34) (a3) = ty, t= 0,1, 02, 
Attempt to prove that p2,+41 is the unique polynomial with these 
properties, by adapting the uniqueness proofs in Sections 6.2 
and 6.4, using Rolle’s Theorem; explain where the proof fails. 
Show that there is no polynomial ps € Ps such that ps(—1) = 1, 
ps(0) = 0, ps(1) = 1, p¥(—1) =0, p¥(0) = 0, p¥(1) = 0, but that 
if the first condition is replaced by p5(—1) = —1, then there is an 
infinite number of such polynomials. Give an explicit expression 


n. 


for the general form of these polynomials. 

Suppose that n > 1. The function f and its derivatives of 
order up to and including 2n + 1 are continuous on [a,b]. The 
points 2;,?=0,1,...,, are distinct and lie in [a, b]. Construct 
polynomials Io(a), hi(x), k(x), 7 = 1,...,n, of degree 2n such 
that the polynomial 


P2n(x) = lo(x) f (xo) + > lhi() f(a) + ki(x) f' (x) 


satisfies the conditions 
Pon(ai) = f(ai), t=0,1,...,n, 
and 
Don (ti) = f' (ti), t=1,...,n. 
Show also that for each value of x in [a,b] there is a number yn, 
depending on 2, such that 


= Sat TET pom Ey), 


Suppose that n > 2. The function f and its derivatives of order 
up to and including 2n are continuous on [a,b]. The points 
x“;,1=0,1,...,n, are distinct and lie in [a,b]. Explain how to 


f(z) =< Pon(x) a 


6.8 


6.9 
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construct polynomials [p(a), In(a), hi(ax), ki(w), i= 1,...,n—1, 
of degree 2n — 1 such that the polynomial 


Pon—1(x2) =lo(x) f (0) +n (a) f (@n) Sth f (wa) +hi(a) f’ (#:)] 


satisfies the conditions pon—1(x;) = f(ai), 7 = 0,1,...,n, and 
Don—1(2i) = f' (ai), 7=1,...,n—1. It is not necessary to give 
explicit expressions for these polynomials. 

Show also that for each value of x in [a,b] there is a number 
7, depending on «, such that 


(z—« Nese) Die “1 (a — 2)? 2n 
f(@) — Pon-1(2) = fern). 


By considering the symmetry of the graph of the polynomial 
q(x) = 2(2* — 1)(x* — 4)(x —3), 


show that the maximum of |q(x)| over the interval [0,1] is at- 
tained at the point « = $. 

The values of the function f: x +> sin x are given at the points 
x; = in/8, for all integer values of i. For a general value of x, 
an approximation u(x) to f(a) is calculated by first defining k 
to be the integer part of 8x/7, so that x, < a < @p41, and 
then evaluating the Lagrange polynomial of degree 5 using the 
six interpolation points (,, f(x;)), j =k—2,...,k +3. Show 
that, for all values of z, 


: 225 76 
|sinaz — u(a)| < 166 x6 < 0.00002. 
Let n > 1. The interpolation points 7;, 7 = 0,1,...,2n —1, 
are distinct, and 2,4; = x; + € for each j = 0,...,n —1. The 


Lagrange polynomial of degree 2n — 1 for the function f using 
these points is denoted by po, 1. Show that the terms involving 
f(v;) and f(an+4;) in pon—1 may be written 


et) ORES) | SE Hej +) - 2 teas) h 


Ep; (xj) p5 (x3 + €) yj (xj — €) 
where 
;(z) = [|] (@- a) 
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Find the limit of this expression as « — 0, and deduce that 
P2n—-1 — G2n—-1 2 0 as © — 0, where qon_1 is the Hermite inter- 
polation polynomial for f, using the points 7;, i =0,...,n—1. 
Construct the Hermite interpolation polynomial of degree 3 for 
the function f: 2 +> 2°, using the points 79 = 0, 21 = a, 
and show that it has the form p3(x) = 3a?a° — 2a°x?. Verify 
Theorem 6.4 by direct calculation, showing that in this case € 
is unique and has the value € = 4(x + 2a). 

The complex function z + f(z) of the complex variable z is 
holomorphic in the region D of the complex plane; the boundary 
of D is the simple closed contour C. The interpolation points 
xj, j = 0,1,...,n, with n > 1, and the point z all lie in D. 
Determine the residues of the function g defined by 


on Zz Ly 
j=0 


at its poles in D, and deduce that 


OC oe eee 


~~ Qra zZ-2 2-2; 


where p,, is the Lagrange interpolation polynomial for the func- 


tion f using the interpolation points 7;, 7 =0,1,...,n. 

Now, suppose that the real number x and the interpolation 
points z;, 7 = 0,1,...,n, all lie in the real interval [a, 6], and 
that D consists of all the points z such that |z —t| < K for all 
t € [a,b], where K is a constant with K > |b — al. Show that 


the length of the contour C is 2(b — a) + 27K, and that 
(b-—a+7K)M (ty 


Ife) — pa(a)| < x 


where M is such that |f(z)| <M on C. Deduce that the se- 
quence (p,) converges to f, uniformly on |[a, 0]. 

Show that these conditions are not satisfied by the function 
f: 2+ 1/(1 +27) for x in the interval [—5,5]. For what values 
of a are the conditions satisfied by f for x in the interval [—a, a]? 
With the same notation as in Example 6.3, let 


Suppose that f’’(x) exists and is continuous at all x € [—h, Al. 


TT 
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By expanding f(h) and f(—A) into Taylor series about the point 
0, show that there exists € € (—h,h) such that 


1 64 —€_ 
Eth) = =p? mr + ; 
(h) = =n g"(g) + A 
Hence deduce that 
1 E 
a os ee 
\B(h)| < ah? Ms + > 


where Mz = maxze\—n,n} |f’"(x)| and e = max(|e+|, |e_|). Show 
further that the right-hand side of the last inequality achieves 
its minimum value when 


7 


Numerical integration — | 


7.1 Introduction 


The problem of evaluating definite integrals arises both in mathematics 
and beyond, in many areas of science and engineering. At some point in 
our mathematical education we all learned to calculate simple integrals 


1 wT 
| edz or ih cos x dx 
0 0 


using a table of integrals, so you will know that the values of these are 
e — 1 and 0 respectively; but how about the innocent-looking 


1 T 
| e” dx and i cos(a”)dz , 
0 0 


or the more exotic 


such as 


2000 
| exp(sin(cos(sinh(cosh(tan™ ‘(log(«))))))) da? 


Please try to evaluate these using a table of integrals and see how far 
you can get! It is not so simple, is it? Of course, you could argue that 
the last example was completely artificial. Still, it illustrates the point 
that it is relatively easy to think of a continuous real-valued function f 
defined on a closed interval [a,b] of the real line such that the definite 
integral 


b 
/ f(a) dx (7.1) 
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is very hard to reduce to an entry in the table of integrals by means of 
the usual tricks of variable substitution and integration by parts. If you 
have access to the computer package Maple, you may try to type 


evalf (int (exp(sin(cos(sinh(cosh(arctan(log(x))))))), x=1..2000)); 


at the Maple command line. In about the same time as it will take you 
to correctly type the command at the keyboard, as if by magic, the result 
1514.780678 will pop up on the screen. How was this number arrived 
at? 

The purpose of this chapter, and its continuation, Chapter 10, is to 
answer this question. Specifically, we shall address the problem of eval- 
uating (7.1) approximately, by applying the results of Chapter 6 on 
polynomial interpolation to derive formulae for numerical integration 
(also called numerical quadrature rules). We shall also explain how one 
can estimate the associated approximation error. What does polynomial 
interpolation have to do with evaluating definite integrals? The answer 
will be revealed in the next section which is about a class of quadrature 
formulae bearing the names of two English mathematicians: Newton 
and Cotes.! 


7.2 Newton—Cotes formulae 


Let f be a real-valued function, defined and continuous on the closed 
real interval [a,b], and suppose that we have to evaluate the integral 


[ f(a)de. 


Since polynomials are easy to integrate, the idea, roughly speaking, is 
to approximate the function f by its Lagrange interpolation polynomial 
Dn of degree n, and integrate p, instead. Thus, 


b b 
i f(a)da = ‘ Dn(a)dz . (7.2) 
For a positive integer n, let x;, i = 0,1,...,, denote the interpolation 


1 Roger Cotes (10 July 1682, Burbage, Leicestershire, England — 5 June 1716, Cam- 
bridge, Cambridgeshire, England) was a fellow of Trinity College in Cambridge. 
At the age of 26 he became the first Plumian Professor of Astronomy and Ex- 
perimental Philosophy. Even though he only published one paper in his lifetime, 
entitled ‘Logometria’, Cotes made important contributions to the theory of loga- 
rithms and integral calculus, particularly interpolation and table construction. In 
reference to Cotes’ early death, Newton said: If he had lived we might have known 
something. 
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points; for the sake of simplicity, we shall assume that these are equally 
spaced, that is, 
u,=at+th, $= ONL 34, 05 
where 
h=(b-a)/n. 
The Lagrange interpolation polynomial of degree n for the function f, 


with these interpolation points, is of the form 


n 


pale) = So Lale)flex) where Lele) =] 2—™ 
k=0 ps0, ‘ 
TAR 


Inserting the expression for p,, into the right-hand side of (7.2) yields 


th n 
[ f@ar~ DY unfler), (7.3) 
a k=0 


where 
b 
w= [ [y(a)dx, k=0,1,...,n. (7.4) 
The values wz, k = 0,1,...,n, are referred to as the quadrature 
weights, while the interpolation points 7,, k = 0,1,...,n, are called 


the quadrature points. The numerical quadrature rule (7.3), with 
quadrature weights (7.4) and equally spaced quadrature points, is called 
the Newton—Cotes formula of order n. In order to illustrate the 
general idea, we consider two simple examples. 

Trapezium rule. In this case we take n = 1, so that 79 = a, 41 = 5; 
the Lagrange interpolation polynomial of degree 1 for the function f is 
simply 


pi(z) = Lo(x) f(a) + Li(x)f(6) 
x—b r—a 


= Sh +r) 


= —l0 — x) f(a) + (w — a) f(d)]. 


Integrating p;(a) from a to b yields 


b-—a 
2 


b 
| fade x ="*[F(@) + FO). 


This numerical integration formula is called the trapezium rule. The 
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terminology stems from the fact that the expression on the right is the 
area of the trapezium with vertices (a,0), (b,0), (a, f(a)), (b, f(d)). 
Simpson’s rule.! A slightly more sophisticated quadrature rule is 
obtained by taking n = 2. In this case % = a, 1 = (a + 6)/2 and 
x2 = b, and the function f is approximated by a quadratic Lagrange 
interpolation polynomial. 
The quadrature weights are calculated from 


b 
wo = i, Lo(a)dax 
(a — 21)(x — x2) 
a (zo — 21) (Xo — By 
b- 


i t(t—1) 
s- (po wat 
b-—a 


6 9 
where it is convenient to make the change of variable 


dd ne 
2 ed 


Similarly, w; = 4(b—a), and it is easy to see that w2 = wo by symmetry. 
This gives 


[ sera?" [rear (24) + 203], 


a numerical integration formula known as Simpson’s rule. 
It is very important to notice that the weights w,; defined in (7.4) 
depend only on n and k, not on the function f. Their values can therefore 


ti 


1 Thomas Simpson (20 August 1710, Market Bosworth, Leicestershire, England — 
14 May 1761, Market Bosworth, Leicestershire, England) was a weaver by training 
who taught mathematics in the London coffee-houses. His two-volume work enti- 
tled The Doctrine and Application of Fluxions published in 1750 contains some of 
the work that Cotes hoped to publish with Cambridge University Press but was 
prevented by his premature death. In 1796 fellow mathematician Charles Hutton 
gave the following description of Simpson: It has been said that Mr Simpson fre- 
quented low company, with whom he used to guzzle porter and gin: but it must 
be observed that the misconduct of his family put it out of his power to keep the 
company of gentlemen, as well as to procure better liquor. On a related subject: 
in his New Stereometry of Wine Barrels (Nova stereometria doliorum vinariorum 
(1615)), the astronomer Johannes Kepler (1571-1630) approximated the volumes 
of many three-dimensional solids, each of which was formed by revolving a two- 
dimensional region around an axis line. For each of these volumes of revolution, 
he subdivided the solid into many thin slices the sum of whose volumes then 
approximated the desired total volume. 
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be calculated in advance, as in the trapezium rule and Simpson’s rule. 
The evaluation of the approximation to the integral (7.1) is then a trivial 
matter; it is only necessary to compute f(x,) at each of the quadrature 
points z,, k = 0,1,...,n, multiply by the known weights wz for k = 
0,1,...,, and form the sum on the right-hand side of (7.3). 


7.3 Error estimates 
Our next task is to estimate the size of the error in the numerical in- 
tegration formula (7.3), that is, the error that has been committed by 
integrating the interpolating Lagrange polynomial of f instead of f it- 
self. The error in (7.3) is defined by 


b n 
f)= [fede - Yo wns or). 
a k=0 


The next theorem provides a useful bound on £,,(f) under the additional 
hypothesis that the function f is sufficiently smooth. 


Theorem 7.1 Let n > 1. Suppose that f is a real-valued function, 
defined and continuous on the interval [a,b], and let f("+) be defined 
and continuous on [a,b]. Then, 


IEx(AI< Mott finns lade, (7.5) 


(n+ 1)! 


where My41 = maxceja,y |f"FY (¢)| and tnr41(x) = (w@—20)...(@-Fn). 


Proof Recalling the definition of the weights w; from (7.4), we can write 
E,(f) as follows: 


b bron 
Ef) = jae J (Sotateysien) a 


II 
o— 
o 
SS 
F ee 
8 
i 
| 
is) 
3 
Pte. 
oe 
Q 
8 


Thus, 


Eas [ul - (x)| dx. 


The desired error estimate (7.5) follows by inserting (6.8) into the right- 
hand side of this inequality. 
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Let us use this theorem to estimate the size of the error which arises 
from applying the trapezium rule to the integral i f(a) da. In this case, 
with n = 1 and m2(x) = (a — a)(a — 6), the bound (7.5) reduces to 


b 
EVA) < =| |(a — a)(a — b)| da 
b 
= = fo x)(z — a) dx 
(b—a)® 
= 1D M2. (7.6) 


An analogous but slightly more tedious calculation shows that, for 
Simpson’s rule, 


|E2(f)| 


IA 


b 
al \(v — a)(« — (a+ 6)/2)(a —b)| dex 


—aq)* 
= Ov yy, (7.7) 


Unfortunately, (7.7) gives a considerable overestimate of the error in 
Simpson’s rule; in particular it does not bring out the fact that Fo(f) = 0 
whenever f is a polynomial of degree 3. The next theorem will allow us 


to give a sharper bound on the error in Simpson’s rule which illustrates 
this fact. More generally, it is quite easy to prove that when n is odd 
the Newton—Cotes formula (7.3) (with w; defined by (7.4)) is exact for 
all polynomials of degree n, while when n is even it is also exact for all 
polynomials of degree n + 1 (see Exercise 2 at the end of the chapter). 


Theorem 7.2 Suppose that f is a real-valued function, defined and 
continuous on the interval [a,b], and that fi” = f“), the fourth derivate 


of f, 1s continuous on [a,b]. Then, 


: —a 
[foarte +4f(a+8)/2) + £0) = - 
for some € in (a,b). 


Proof Making the change of variable 


a+b b-a 
= t, tef{-1,1 
oa ie *t, tetady, 
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and defining the function t> F(t) by F(t) = f(x), we see that 


b 
[ foe P=" (F(a) + 4f((a-+0)/2) + FO) 


b-—a 


_ (fr F(r Jar — 5 [F(- 1) +4F(0) + FU]). (7.9) 


We now introduce the function t + G(t) by 


G(t) = F(r) dr [F(-t)+4F(0)+ F(®)], t¢€[-1,1]; 
-t 
the right-hand side of (7.9) is then simply $(b — a)G(1). 
The remainder of the proof is devoted to showing that $(b — a)G(1) 
is, in turn, equal to the right-hand side of (7.8) for some € in (a,b). To 
do so, we define 


H(t)=G(t)-#GQ), te[-1,1], 


and apply Rolle’s Theorem repeatedly to the function H. Noting that 
H(0) = H(1) = 0, we deduce that there exists ¢; € (0,1) such that 
A'(¢,) = 0. But it is easy to show that H’(0) = 0, so there exists 
G2 € (0,¢1) such that H”(¢2) = 0. Again we see that H’’(0) = 0, so 
there exists ¢3 € (0,¢2) such that H’’(¢3) = 0. Now, 


Gi") = -51F"() — FM), 
and therefore 
H!" (63) = SF "(G) — F"(—G)] — 6001). 


Applying the Mean Value Theorem to the function F’” this shows that 
there exists ¢, € (—C3, ¢3) such that 


HG) = —[2ceF*(ca)| -— 6086(0) 
= ~78 ri.) + 906(0)). 
Since H’’(¢3) = 0 and ¢3 4 0, this means that 
at) =— 2 rig) = 0 = pig) 
~ 90 4) "1440 


and the required result follows. 
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Theorem 7.2 yields the following bound on the error in Simpson’s rule: 


(b— a) 
2880 
This is a considerable improvement on the earlier bound (7.7); when f 
is a polynomial of degree 3, the bound correctly shows that F2(f) = 0. 

There is a great variety of quadrature rules constructed in the same 
way as the Newton—Cotes formulae. For example, it may sometimes be 
useful to involve quadrature points outside the interval of integration, 
as in 


|Z2(f)| < Ms. (7.10) 


: f(x) da = c_if(—1) + cof (0) +a f(1). (7.11) 


The coefficients are determined similarly as in (7.4), but now x_, = —1, 
tp =0, x; =1 and 


L_\(a) = 42(e-1), L(t) =1—27, Ly(x) = §2(x+1). 


Hence, 


In a similar way we find that co = z, q= 3. 


The quadrature rule (7.11) is then exact when f is any polynomial 
of degree 2 or less. More generally, for any three times continuously 
differentiable function f, Theorem 7.1 extends in an obvious way to give 


[[ sears aren 2 F(0) — &F(1) 


s Bf e+ vele-2 ae 


M3 - 

4? 

but there is an important difference. To justify this estimate we now need 
a condition on f outside the interval of integration: we must require that 
f and f” are continuous on [—1, 1], and M3 is the maximum of | f”’()| 
on [—1, 1]. More generally, the conditions must hold on an interval which 
contains the interval of integration, and also all the quadrature points. 


< 
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Table 7.1. I, is the result of the Newton—Cotes formula of degree n for 
the approximation of the integral (7.12) 


In 


0.38462 
6.79487 
2.08145 
2.37401 
2.30769 
3.87045 
2.89899 
1.50049 
2.39862 
10 4.67330 
11 3.24477 
12 —0.31294 
13 1.91980 
14 7.89954 
15 4.15556 


OAMNAoUBRWNHH] 3 


7.4 The Runge phenomenon revisited 


By looking at the right-hand side of the error bound (7.5) we may be led 
to believe that by increasing n, that is by approximating the integrand by 
Lagrange interpolation polynomials of increasing degree and integrating 
these exactly, we shall reduce the size of the quadrature error E,,(f). 
However, this is not always the case, even for very smooth functions 
f. An example of this behaviour uses the same function as in Section 
6.3; Table 7.1 gives the results of applying Newton—Cotes formulae of 
increasing degree to the evaluation of the integral 


[ Bie (7.12) 


_,14+22 
These results do not evidently converge as n increases, and in fact they 
eventually increase without bound. This behaviour is related to the fact 
that the weights w; in the Newton—Cotes formula are not all positive 
when n > 8. We shall return to this point in Theorem 10.2. 

A better approach to improving accuracy is to divide the interval 
[a,b] into an increasing number of subintervals of decreasing size, and 
then to use a numerical integration formula of fixed order n on each 
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of the subintervals. Quadrature rules based on this approach are called 
composite formulae; in the next section we shall describe two examples.! 


7.5 Composite formulae 


We shall consider only some very simple composite quadrature rules: 

the composite trapezium rule and the composite Simpson rule. 
Suppose that f is a function, defined and continuous on a nonempty 

closed interval [a,b] of the real line. In order to construct an approxi- 


[ seo. 


we now select an integer m > 2 and divide the interval [a, b] into m equal 
subintervals, each of width h = (b — a)/m, so that 


vi f(x) da = a f(x)dz, (7.13) 


mation to 


where 


a, =atih=at+—(b—a), i7=0,1,...,m. 
m 


Each of the integrals is then evaluated by the trapezium rule, 
[ feaes Fh flea) + fed); (7.14) 
Li-1 
summing these over 1 = 1,2,...,m leads to the following definition. 


Definition 7.1 (Composite trapezium rule) 


. 1 1 
[ Hleae~ | Fple) + Flan) +--+ Fema) + 5fl0m)] - (7.19) 


1 The historical roots of composite formulae may be traced back to the work of Ke- 
pler cited in the footnote to Simpson’s method earlier on in this chapter, although 
the idea of computing volumes of two- and three-dimensional geometrical objects 
by subdivision was already present in the work of Archimedes of Syracuse (287 BC, 
Syracuse (now in Italy) — 212 BC, Syracuse (now in Italy)). Archimedes’ long-lost 
book known as the Palimpsest, containing his geometrical studies, resurfaced at an 
auction at Christie’s of New York in 1998 and is now in the care of the Walters Art 
Gallery in Baltimore, Maryland, USA: http://www.thewalters.org/archimedes/ 
frame. htmL 
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The error in the composite trapezium rule can be estimated by using 
the error bound (7.6) for the trapezium rule on each individual subin- 
terval [x;-1,2;], 7 = 1,2,...,m. For this purpose, let us define 

b 


ent) = f(x) dx —h[5f(@o) + f(a1) + +++ + f(@m-1) + oF (em) 


I 
M 


 f@de= Fh een) & reo] , 


Applying (7.6) to each of the terms under the summation sign we obtain 


m 


1 A 
ain < PD (eRe! (Ol) 
= \3 
u 4) Mp, (7.16) 


where Mz = max¢eja,v |f"(¢)|- 

For Simpson’s rule, let us suppose that the interval [a,b] has been 
divided into 2m intervals by the points 7; = a+ih, i = 0,1,...,2m, 
with m > 2 and 

_b-a 


2m ’ 


and let us apply Simpson’s rule on each of the intervals [x9;~2, roi], 
i= 1,2,...,m, giving 


[soe 


I 


2i-2 


si f(a) dx 


2 


oe 7 Tf leas) +4f(xai-1) + (x2) - 


i=l 


This leads to the following definition. 


Definition 7.2 (Composite Simpson rule) 


b 
J floax ~ FLfleo) +4f (ar) + 2f lea) +4 flea) ++ 
or 2 f (®am_2) a Af (®am—1) f(tam)| 7 (7.17) 


A schematic view of the pattern in which the coefficients 1, 4 and 2 
appear in the composite Simpson rule is shown in Figure 7.1. 
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142424242 42 4 1 
t+ +++ ti 


Fig. 7.1. Quadrature weights for the composite Simpson rule: the integers 
1,4,2,4,...,1, when multiplied by h/3, where h = (b — a)/2m, provide the 
quadrature weights. This figure corresponds to taking m = 6. 


In order to estimate the error in the composite Simpson rule, we pro- 
ceed in the same way as for the composite trapezium rule. Let us define 


b m 
Ex(f) = f(x) da — De * [Fla2i-2) + Af (xoi_1) + f(224)] 
Mm C24 h 
7 2 i ho a g Lf (w2i-2) + 4f (v2ai-1) + f (2%) 


Applying (7.10) to each individual term in the sum and recalling that 
b— a= 2mh we obtain the following error bound: 


(b—a)° 


oe 
|E2(f)| — 2880m+4 


Mz, (7.18) 


where M4 = maxceja,p] |f”(¢)|- 

The composite rules (7.15) and (7.17) provide greater accuracy than 
the basic formulae considered in Section 7.2; this is clearly seen by com- 
paring the error bounds (7.16) and (7.18) for the two composite rules 
with (7.6) and (7.8), the error estimates for the basic trapezium rule 
and Simpson rule respectively. The inequalities (7.16) and (7.18) indi- 
cate that, as long as the function f is sufficiently smooth, the errors in 
the composite rules can be made arbitrarily small by choosing a suffi- 
ciently large number of subintervals. 


7.6 The Euler—Maclaurin expansion 
We have seen in (7.16) that the error in the composite trapezium rule is 
bounded by a term involving 1/m?, where m is the number of subdivi- 
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sions of the interval [a,b]; the Euler'—Maclaurin? expansion expresses 
this error as a series in powers of 1/m?, and makes it possible to improve 
accuracy by extrapolation methods. 

We first define a sequence of polynomials. 


Definition 7.3 Consider the sequence of polynomials q,, r = 1,2,..., 
defined by their properties, as follows: 


(i) q is a polynomial of degree r; 
(ii) for each positive integer r, q.44 = Ors 
(iii) gq, is an odd function if r is odd, and an even function if r is 


even; 
(iv) ifr > 1 ts odd, then q,-(—1) =0 and q,(1) = 0; 
(v) q(t) = -t. 


Using these conditions it is easy to construct the polynomials gq, in 
succession. From (v) and (ii) we get 


qo(t) = —5t7 + Ao, q3(t) = —gt? + Aot + As, 


where Az and A3 are constants. From (iii) we see that A3 = 0; then, 
from (iv) it follows that Ay = 4. Hence, 


@)=-5e+§, g(t) =—get gt. 
We can then go on to construct q4 and qs, and so on. 


1 Leonhard Euler (15 April 1707, Basel, Switzerland — 18 September 1783, St Pe- 
tersburg, Russia) was the most prolific mathematical writer of all times, who made 
fundamental contributions to many branches of mathematics despite being totally 
blind for the last third of his life. Euler and his wife Katharina had 13 children: 
he claimed to have made his greatest discoveries while he was holding a baby in 
his arms and the other children were playing around his feet. Euler studied the 
calculus of variations, differential geometry, number theory, differential equations, 
continuum mechanics, astronomy, lunar theory, the three-body problem, elasticity, 
acoustics, the wave theory of light, hydraulics, and music. In his Theory of the 
Motions of Rigid Bodies published in 1765 he laid the foundation of analytical 
mechanics. Euler integrated Leibniz’s differential calculus and Newton’s method 
of fluxions into mathematical analysis. We owe him the concepts of beta and 
gamma functions and the notion of integrating factor for differential equations; 
he is responsible for the notation e for the base of natural logarithm, f(a) for a 
function, 7 for pi, )> for summation, i for the square root of —1, and Ay and ae 
for the first and second finite differences. 

2 Colin Maclaurin (February 1698, Kilmodan, Argyllshire, Scotland — 14 June 1746, 
Edinburgh, Scotland) became a student at the University of Glasgow at the age 
of 11 and completed his studies at the age of 14. In 1719, at the age of 21, he 
became Fellow of the Royal Society. His major work of 763 pages in two volumes, 
entitled A Treatise of Fluxions, was the first systematic exposition of Newton’s 
ideas. Notable is Maclaurin’s work on elliptic integrals, maxima and minima, and 
the attraction of ellipsoids. 
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Theorem 7.3 Suppose that the function g is defined and continuous on 
the interval [—1,1] and has a continuous derivative of order 2k over this 
interval. Then, 


‘| a(t) dt — [9(-1) + 9(1)] = i _—tal(tat 


—1 
1 


k 
= Y aar(1)lo™-P) = 9 M(—2)] = fF aanltg? at. (7.19) 


-1 


Proof We observe that has g(t) dt — [g(—1) + g(1)] is the error in the 


approximation of bie g(t)dt by the trapezium rule. Integration by parts 
gives 


1 1 
[tat = -W-y+aai+ f oper, 
=A -1 

which establishes the first equality in (7.19). By repeated integration by 
parts in the other direction, and using the fact that q(t) = —t, we then 
have 


/ -ig/()dt = q(1)g'(1) — a2(—Vg'(-1) / go(t)g!(t)at 


-1 


I 
— 
Q 
N 
— 
oH 
ue 
k=) 
~ 
— 
oH 
Ww 
| 
Q 
w 
— 
H+ 
We 
k=) 
= 
— 
HK 
we 
+ 
+ 
Ro) 
iS) 
ce 
— 
oH 
wa 
k=) 
— 
bo 
cod 
a 
VS 
— 
oH 
we 


The required result follows from properties (iii) and (iv) of the q,. 


Theorem 7.4 (Euler—Maclaurin expansion) Suppose that the real- 
valued function f is defined and continuous on the interval [a,b] and 
has a continuous derivative of order 2k on this interval. Consider the 
subdivision of [a,b] into m > 1 closed intervals [a;-1,4;], i = 1,...,m, 
where x; = at+ih,i=0,1,...,m, andh = (b—a)/m. Writing T(m) for 
the result of approximating the integral I = iis f(x)da by the composite 
trapezium rule with the m subintervals [x;-1, ai], 7 =1,...,m, 


k 
T—T(m) = Slob fOr-Y(b) — fOr-Y (a)] 


= (Oy 3 i Z qor(t) fC” (x) da, — (7.20) 


214 7 Numerical integration — I 


where t = t(z) = —1+ F(a —a;-1) for x € [aj-1, vi], i=1,...,m, and 
Co Gath iy2 fore Wawa ke 


Proof We express the integral as a sum over the m subintervals [x;_1, x;], 
¢=1,...,m,asin (7.13). In the interval [x;_1, x;] we change the variable 
by writing « = a;_; + h(t+ 1)/2, so that 


[PO terar=3 [ aeae, 


where f(x) = g(t). According to Theorem 7.3, then, 


fo seayae — Filo) + Fe) 


{ f spae—to(-1) + ota} 


=F {smc [g"-9 (1) — g??-Y(-1)] 


— faxing (oar 


On noting that g(t) = (h/2)* f(a), @ = 1,2,...,2k, dt = (2/h) da, 
summation over all the subintervals [x;_1, 2;], for 7 = 1,...,m, gives the 
required result. The important point is the symmetry of the polynomials 
qr, Which ensures that q2,(1) = q2,(—1), so that all the derivatives of f 
at the internal points x; cancel in the course of summation, leaving only 
the derivatives at a and b. 


Remark 7.1 By successively computing the polynomials q,(t), we can 


determine the values of cy = qor(1)/2?", r= 1,2,3,.... For example, 

ot oe Ay < 1 1 a 1 
Cl = ~ 72 C2 = 79> ©3 = — 30240 “4 = T209600 > “5 — — Z7900160 °° °° 
It can be shown that cp = -@ for all r = 1,2,3,..., where Ba, are 


the Bernoulli numbers' with even index, which can _ determined from 


1 Jacob Bernoulli the elder (27 December 1654, Basel, Switzerland — 16 August 
1705, Basel, Switzerland) was one of the first mathematicians to recognise the 
significance of the work of Newton and Leibniz on differential and integral calcu- 
lus. Bernoulli contributed to the theory of infinite series, mechanics, calculus of 
variations, mechanics, and is also known in probability theory for his Law of Large 
Numbers. 
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the Taylor series expansion 


x x ~ B et 
£ coth ($) = S- Gry) 
r=0 


Easier still, typing c{6]=-bernoulli(12)/12!; at the Maple command 


691 ; ' 
1307674368000? C7,Cg,--.-. can be found in the same way. 


line gives cg = 

An interesting consequence of Theorem 7.4 concerns the numerical 
integration of smooth periodic functions. Suppose that f is a continuous 
function defined on (—oo,0o) such that all derivatives of f, up to and 
including order 2k, are defined and continuous on (—oo,co), and f is 
periodic on (—oo,0o) with period b — a; t.e., f(a +b—a)— f(x) = 0 
for all  € R. Hence, by successive differentiation of this equality and 
taking x = a we deduce that, in particular, 


f-Y(b) — fea) =0 = for r =1,2,...,k. 
Therefore, according to (7.20), we have that 
I-—T(m) =O(h**). 


The fact that for k >> 1 this integration error is much smaller than the 
O(h?) error that will be observed in the case of a nonperiodic function 
indicates that the composite trapezium rule is particularly well suited 
for the numerical integration of smooth periodic functions. 

A second application of the Euler—Maclaurin expansion concerns ex- 
trapolation methods. This subject will be discussed in the next section. 


7.7 Extrapolation methods 


In general the calculation of the higher derivatives involved in the Euler— 
Maclaurin expansion (7.20) is not possible. However, the existence of the 
expansion allows us to eliminate successive terms by repeated calculation 
of the trapezium rule approximation. 

For example, the case k = 2 of (7.20) may be written in the form 


b 
/ f(x)dx — T(m) = Ch? + O(m-*), 
where C) = ci[f’(b) — f’(a)] and h = (b—a)/m. This also means that 


b 
/ f(a)da — T(2m) = Cy (h/2)2 + O(m-4). 
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We can eliminate the term in h? from these two equalities, giving 


b 
i f(a)dx = AT (2m) — T(m) + O(h*). 
& 3 
The same elimination process could be used for any two values of m, 

from the calculation of T(m1) and T(mz2); the advantage of using m and 
2m is that in the computation of T(2m) half the required values of f(a;) 
are already known from T(m), and we do not have to calculate them 
again. This process of eliminating the term in h? from the expansion 
of the error is known as Richardson extrapolation! or h? extrap- 
olation. It is easy to extend the process to higher-order terms. For 
example, 


b 
i f(x)dx — T(m) = Cyh? + Coh* + C3h® + O(R8). 


Hence 
b 
4T (2m) —T 
i, f(a)da ( nm (m) = 4Coh4 ZC3h° O(h®), 
which leads to 
a 167) (2m) — T. 
f(a)de— SHEA _ one), 
where 
iG = em) — T(m) 
Therefore, 
16T, (2m) — T; 
T2(m) = 6 1( m) 1(m) 


15 


approximates the integral r f(x)dx to accuracy O(h®). Adopting the 
notational convention 


and proceeding recursively, 


1 Lewis Fry Richardson (11 October 1881, Newcastle upon Tyne, Northumberland, 
England — 30 September 1953, Kilmun, Argyllshire, Scotland) studied mathemat- 
ics, physics, chemistry, botany and zoology at the Durham College of Science, 
and subsequently Natural Science at King’s College in Cambridge. He worked in 
the National Physical Laboratory and the Meteorological Office, and was the first 
to apply numerical mathematics, in particular the method of finite differences, 
to predicting the weather in Weather Prediction by Numerical Process (1922). 
The Richardson number, a quantity involving gradients of temperature and wind 
velocity is named after him. 
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Table 7.2. Romberg table. 


m T(m) Ti(m) Te(m) T3(m)  Ts(m) 
4 T(4) Ti(4) Te(4) T3(4) T4(4) 
8 (8) Ti(8) Ta(8)  Ts(8) my 
16 T(16) (16) (16) 
Bo TP 6a2)4 (32) 
64 T(64) 
T(m) = es sy ae Se Ds oak 5 (7.21) 


will approximate i f(x)dx to accuracy O(h?**?), provided of course 
that f(?*+?) exists and is continuous on the closed interval [a,b]. This 
extrapolation process is known as the Romberg! integration method. 
The intermediate results in Romberg’s method are often arranged in 
the form of a table, known as the Romberg table. For example, if we 
start with m = 4 subdivisions of the closed interval [a, b], each of length 
h = (b—a)/4, and proceed by doubling the number of subdivisions in 
each step (and thereby halving the spacing h between the quadrature 
points from the previous step), then the associated Romberg table is 
as shown in Table 7.2, where we took, successively, m = 4,8, 16, 32,64 
subdivisions of the interval [a,b] of length h = (b—a)/m each. Af 
ter To(4) = T(4),...,7o(64) = T(64) have been computed, we cal- 
culate T;(4),...,71 (82) using (7.21) with k = 1, then we compute 
T2(4),...,72(16) using (7.21) with & = 2, then T3(4),73(8) using (7.21) 
with & = 3, and finally T,(4) using (7.21) with k = 4. Provided that 
the integrand is sufficiently smooth, the numbers in the T(m) column 
approximate the integral to within an error O(h?); the numbers in the 
Ti(m) column to within O(h*), those in the T2(m) column to O(h°), 
those in the T3(m) column to O(h®), and those in the T,(m) column to 
within O(h1°). 
1 Werner Romberg, Emeritus Professor at the Institute of Applied Mathematics at 
the University of Heidelberg in Germany. The extrapolation process was proposed 


in his paper Vereinfachte numerische Integration [German], Norske Vid. Selsk. 
Forh., Trondheim 28, 30-36, 1955. 
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An example is shown in Table 7.3. This gives the results of calculating 


the integral 
1 —29 
| 7 da 
9 1+4a 


by Romberg’s method; first the trapezium rule is used successively with 
m = 4,8, 16,32 and 64 equal subdivisions of the interval [0, 1] of length 
h = (b—a)/m each. There are then four stages of extrapolation: Stage 1 
involves computing T\(m) for m = 4,8, 16,32; Stage 2 computes T2(m) 
for m = 4,8,16; Stage 3 calculates T3(m) for m = 4,8; and Stage 4 
then computes T,(m) for m = 4. Not only does the extrapolation give 
an accurate result, but the consistency of the numerical values in the 
last two columns gives a good deal of confidence in quoting the result 
0.220458 correct to six decimal digits. Note that none of the individual 
composite trapezium rule calculations in the T'(m) column gives a result 
correct to more than three decimal digits — not even T(64) which uses 
64 equal subdivisions of [0, 1]. 


Table 7.3. Romberg table for the calculation of Lert + 4ax))da. 


m T(m) Ti(m) T2(m) T3(m) Ta(m) 


4 0.248802 0.221038 0.220470 0.220458 0.220458 
8 0.227979 0.220505 0.220459 0.220458 

16 0.222374 =0.220461 0.220458 

32 0.220940 = 0.220458 

64 0.220579 


The success of Romberg integration is only justified if the integrand 
f satisfies the hypotheses of the Euler—Maclaurin Theorem. As an illus- 
tration of this, Table 7.4 shows the result of the same calculation, but 


for the integral 
1 
i ada. 
0 


The function « + 2/8 is not differentiable at z = 0, so the required 
conditions are not satisfied for any extrapolation. The numerical results 
bear this out; they are quite close to the correct value, 3/4, but the be- 
haviour of the extrapolation does not give any confidence in the accuracy 
of the result. In fact the extrapolation has not given much improvement 
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on T(64). The calculation of integrals involving this sort of singularity 
requires special methods which we shall not discuss here. 

We have reached the end of this chapter, but do not despair: the story 
about numerical integration rules will continue. In Chapter 10 we shall 
discuss a class of quadrature formulae, generally referred to as Gaussian 
quadrature rules, which are distinct from the Newton—Cotes formulae 
considered here. Before doing so, however, in Chapters 8 and 9 we make 
a brief excursion into the realm of approximation theory. 


Table 7.4. Romberg table for the calculation of fs a/3dax. 


m T(m) Ti(m) T2(m) T3(m) Ta(m) 


4 0.708055 0.741448 0.746950 0.748819 0.749534 
8 0.733100 0.746606 0.748790 0.749531 

16 0.743230 0.748653 0.749520 

32 0.747297 0.749465 

64 0.748923 


7.8 Notes 


The material presented in this chapter is classical. For further details on 
the theory and practice of numerical integration, we refer to the following 
texts: 


> Puitip J. Davis AND PHILIP RABINOWITZ, Methods of Numerical 
Integration, Second Edition, Computer Science and Applied Mathe- 
matics, Academic Press, Orlando, FL, 1984; 

» VLADIMIR IVANOVICH KRYLOV, Approximate Calculation of Inte- 
grals, translated from Russian by Arthur H. Stroud, ACM Monograph 
Series, Macmillan, New York, 1962; 

» HERMANN ENGELS, Numerical Quadrature and Cubature, Computa- 
tional Mathematics and Applications, Academic Press, London, 1980. 


The first of these is a standard text and contains a huge bibliography 
of more than 1500 entries. Concerning the implementation of numerical 
integration rules into mathematical software, the reader is referred to 


» ARNOLD R. KROMMER AND CHRISTOPH W. UEBERHUBER, Compu- 
tational Integration, SIAM, Philadelphia, 1998. 
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It includes a comprehensive overview of computational integration tech- 
niques based on both numerical and symbolical methods, and an exposi- 
tion of some more recent number-theoretical, pseudorandom and lattice 
algorithms; these topics are beyond the scope of the present text. 


Exercises 


7.1 With the usual notation for the Newton—Cotes quadrature for- 
mula and using the equally spaced quadrature points x, = a+kh 
for k = 0,1,...,n and n > 1, show that we = wn—,z for 
k=0,1,...,n. 

ri: By considering the polynomial [2 —(a+b)/2]"*1, n > 1, and the 
result of Exercise 1, or otherwise, show that the Newton—Cotes 
formula using n+ 1 points z,, k = 0,1,...,n, is exact for all 
polynomials of degree n + 1 whenever n is even. 

7.3 A quadrature formula on the interval [—1, 1] uses the quadrature 
points x) = —a and 7; = a, where0 <a<1l: 


ie f(x)da = wof(—a) + wif(a). 


The formula is required to be exact whenever f is a polynomial 
of degree 1. Show that wo = w; = 1, independent of the value 
of a. Show also that there is one particular value of a for which 
the formula is exact also for all polynomials of degree 2. Find 
this a, and show that, for this value, the formula is also exact 
for all polynomials of degree 3. 

7.4 The Newton—Cotes formula with n = 3 on the interval [—1, 1] is 


ie _ Fla) de wp f(—1) + ws f(-1/3) + we f(1/3) + 09 f(0). 


Using the fact that this formula is to be exact for all polynomials 
of degree 3, or otherwise, show that 


2wo + 2wy => 


wy bd 


2wo + zw => 


and hence find the values of the weights wo, wi, w2 and w3. 
7.5 For each of the functions 1,7, 2?,..., 2°, find the difference be- 
tween ee f(x)da and (i) Simpson’s rule, (ii) the formula derived 
in Exercise 4. 
Deduce that for every polynomial of degree 5 formula (ii) is 


7.6 


Cl 


7.8 


7.9 
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more accurate than formula (i). Find a polynomial of degree 6 
for which formula (i) is more accurate than formula (ii). 
Write down the errors in the approximation of 


1 1 
| atdx and i x da 
0 0 


by the trapezium rule and Simpson’s rule. Hence find the value 
of the constant C’ for which the trapezium rule gives the correct 
result for the calculation of 


a 
if (2° — Cx) da, 
0 


and show that the trapezium rule gives a more accurate result 
than Simpson’s rule when 7 <C< =. 

Determine the values of c;, 7 = —1,0,1,2, such that the quadra- 
ture rule 


Q(f) = c-af(-1) + cof(0) + er fC) + caf (2) 


gives the correct value for the integral 


[ fe) 


when f is any polynomial of degree 3. Show that, with these 
values of the weights c;, and under appropriate conditions on 
the function f, 


1 
[ seax- a1] < BM. 


Give suitable conditions for the validity of this bound, and a 
definition of the quantity M4. 

Writing T(m) for the composite trapezium rule defined in (7.15) 
and $(2m) for the composite Simpson’s rule defined in (7.17), 
show that 


S(2m) = $7 (2m) — $T(m). 


Suppose that the function f has a continuous fourth deriva- 
tive on the interval [a,b], and that T(m) denotes the composite 
trapezium rule approximation to lie f(x)daz, using m subinter- 
vals. Show that 
T(m) — T(2m) 
T(2m) — T(4m) 


4 asm— ow. 
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Using the information in Table 7.3 evaluate this expression for 
m = 4,8, 16. 

With the same notation as in Exercise 9, suppose that the fourth 
derivative of f is not continuous on [a, 6], but that 


b 
i. f(a)dx — T(r) = Afm® + E(m), 


where a > 0 and A are constants and lim». m°E(m) = 0. 
Determine 

lim T(m) — T(2m) 

m—>co T(2m) — T(4m) 


Suggest a value of a which is consistent with the values of T'(m) 
given in Table 7.4. 

The function f has a continuous fourth derivative on the in- 
terval [—1,1]. Construct the Hermite interpolation polynomial 
of degree 3 for f using the interpolation points 7 = —1 and 
x, = 1. Deduce that 


Construct the polynomials q4,q5,q6 and q7 given by Definition 
7.3. Hence show that, in the notation of Theorem 7.4, 


cy = 1/12, c2=1/720, c3 = —1/30240. 


Using the relations 


m 
2sin $x y sinjz = cos$x—cos(m+ 3)z, 
j=1 
m 
- 1 i a ihe 1 Cares 
2sin 52 y cosjz = sin(m+ 5)@—sin 52, 
j=l 


where m is a positive integer, show that the composite trapez- 
ium rule (7.15) with m subintervals will give the exact result for 
each of the integrals 


re Tv 
/ cosrz dz , , sinrz dz, 
—7T = 
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for any integer value of r which is not a multiple of m. 
What values are given by the composite trapezium rule for 
these integrals when r = mk and &k is a positive integer? 


8 


Polynomial approximation in the oo-norm 


8.1 Introduction 


In Chapter 6 we considered the problem of interpolating a function by 
polynomials of a certain degree. Here we shall discuss other types of 
approximation by polynomials, the overall objective being to find the 
polynomial of given degree n which provides the ‘best approximation’ 
from P,, to a given function in a sense that will be made precise below. 


8.2 Normed linear spaces 


In order to be able to talk about ‘best approximation’ in a rigorous 
manner we need to recall from Chapter 2 the concept of norm; this will 
allow us to compare various approximations quantitatively and select the 
one which has the smallest approximation error. The definition given in 
Section 2.7 applies to a linear space consisting of functions in the same 
way as to the finite-dimensional linear spaces considered in Chapter 2. 


Definition 8.1 Suppose that V is a linear space over the field R of 
real numbers. A nonnegative function || - || defined on V whose value at 
f € V ts denoted by ||f|| is called anorm on VY if it satisfies the following 
axioms: 

O ||f|| =0 if and only if, f =0 in V; 

@ ||Af\| = Al || fl] for all X © R, and all f in V; 

© ||f+all < IIfll+llgl| for all f and g in V (the triangle inequality). 
A linear space V, equipped with a norm, is called a normed linear 
space. 
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Throughout this chapter [a,b] will denote a nonempty, bounded and 
closed interval of R, and (a, 6) will signify a nonempty, bounded open 
interval of R. 


Example 8.1 The set Cla,b] of real-valued functions f, defined and 
continuous on the interval [a,b], is a normed linear space with norm 


Illa = max, [f(@)|. (8.1) 


The norm || - ||oo is called the oo-norm or maximum norm; it can 
be thought of as an analogue of the oo-norm for vectors introduced in 
Chapter 2. Thus, for the sake of notational simplicity, here we shall use 
the same symbol || - ||.. as in Chapter 2, tacitly assuming in what fol- 
lows that || f||.0 signifies the oo-norm of a continuous function f, defined 
on a bounded closed interval of the real line (rather than the oo-norm 
of an n-component vector). The choice of the interval [a, 6] over which 
the norm is taken will always be clear from the context and will not be 
explicitly highlighted in our notation. © 


Example 8.2 Suppose that w is a real-valued function, defined, con- 
tinuous, positive and integrable on the interval (a,b). The set C[a, 6] of 
real-valued functions f, defined and continuous on [a,b], is a normed 
linear space equipped with the norm 


i 1/2 
ile ( [ weoiseytas) , (8.2) 


The norm ||-||2 is called the 2-norm. The function w is called a weight 
function. The assumptions on w allow for singular weight functions, 
such as w: x € (0,1) + x~'/? which is continuous, positive and inte- 
grable on the open interval (0,1), but is not continuous on the closed 
interval [0,1]. The norm (8.2) can be thought of as an analogue of the 
2-norm for vectors introduced in Chapter 2; thus, for the sake of sim- 
plicity, we use the same notation, ||- ||2, as there. As for the co-norm, we 
shall not explicitly indicate in our notation the interval over which the 
norm is taken. The implied choice of interval [a,b] and weight function 
w will be clear from the context. © 


The next lemma provides a comparison of the oo-norm with the 2- 
norm, defined by (8.1) and (8.2), respectively, on Ca, }]. 
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Lemma 8.1 (i) Suppose that the real-valued weight function w is de- 
fined, continuous, positive and integrable on the interval (a,b). Then, 
for any function f € Cla, b], 


b 


1/2 
Ilflle < W |lflloo wierew= | | vio 


(tt) Given any two positive numbers € (however small) and M (how- 
ever large), there exists a function f € Cla, b] such that 


Iflla<e€,  |Iflloo > M. 


Proof The proof is left as an exercise (see Exercise 1). 


The definitions (2.33) and (2.34) of the vector norms || - ||. and || - | 
on R” imply that 


i) 


nV? lvlloo S llvllz <n’? |aI00 Vu eR", (8.3) 


which means that, to all intents and purposes, these two norms are in- 
terchangeable.! Lemma 8.1 indicates that a similar chain of inequalities 
cannot possibly hold for the norms (8.1) and (8.2) on C[a,b], and the 
choice between them may therefore significantly influence the outcome 
of the analysis. 

Stimulated by the first axiom of norm, we shall think of f € C[a, b] as 
being well approximated by a polynomial p on [a, }] if || — p|| is small, 
where || - || is either || - ||. or || - lz defined, respectively, by (8.1) or 
(8.2). In the light of Lemma 8.1, it should come as no surprise that 
the mathematical tools for the analysis of smallness of || f — pl|.. are 
quite different from those that ensure smallness of || f — p|l2. We have 
therefore chosen to discuss these two matters separately: the present 
chapter focuses on the co-norm (8.1), while Chapter 9 explores the use 
of the 2-norm (8.2). 

Despite the fundamental differences between the norms (8.1) and (8.2) 
which we have alluded to above, there is a common underlying feature 
which is independent of the choice of norm: if no limitation is imposed 
1 The chain of inequalities (8.3) is, in fact, just a particular manifestation of the 

following general result from linear algebra. Suppose that V is a finite-dimensional 


linear space and let || - ||’ and || - ||/” be two norms on VY; then, there exist positive 
real numbers m and M such that 


mlloll’ < [lull < Moll’ Vue. 
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on the degree of the approximating polynomial p, then the approximation 
error f—p can be made arbitrarily smallin both norms. This is a central 
result in the theory of polynomial approximation and is formulated in 
the next theorem. 


Theorem 8.1 (Weierstrass Approximation Theorem!) Suppose 
that f is a real-valued function, defined and continuous on a bounded 
closed interval [a,b] of the real line; then, given any ¢ > 0, there exists 
a polynomial p such that 


IIf — Plloo $ €. 


Further, if w is a real-valued function, defined, continuous, positive and 
integrable on (a,b), then an analogous result holds in the 2-norm over 
the interval [a,b] with weight function w. 


This is an important theorem in classical analysis, and several proofs 
are known. It is evidently sufficient to consider only the interval [0, 1]; 
a simple change of variable will then extend the proof to any bounded 
closed interval [a, 6]. For a real-valued function f, defined and continuous 
on the interval [0,1], Bernstein’s proof uses the polynomial 


Pn (2) = S- Pnx() f(k/n) ) LE (0, 1] ) 
k=0 
where the Bernstein polynomials p,,,(x) are defined by 


Pnk(2) = ( ‘ )ea-ayr, a € [0,1]. 


It can then be shown that, for any ¢ > 0, there exists n = n(e) such that 
| f —Pnlloo < €. The second part of the theorem is a direct consequence 
of this result, using part (i) of Lemma 8.1. 

The details of the proof are given in Exercise 12. For an alternative 
proof, the reader is referred to Theorem 6.3 in M.J.D. Powell, Approxi- 
mation Theory and Methods, Cambridge University Press, 1996. 


1 Karl Theodor Wilhelm Weierstrass (31 October 1815, Ostenfelde, Bavaria, Ger- 
many — 19 February 1897, Berlin, Germany) is frequently referred to as the fa- 
ther of modern mathematical analysis. He made fundamental contributions to 
the theory of series, functions of real variables, elliptic functions, converging infi- 
nite products, the calculus of variations, and the theory of bilinear and quadratic 
forms. Weierstrass’ students included Cantor, Frobenius, Gegenbauer, Holder, 
Hurwitz, Killing, Klein, Kneser, Sofia Kovalevskaya, Lie, Mertens, Minkowski, 
Mittag-Leffler, Schwarz and Stolz. 
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8.3 Best approximation in the oo-norm 


According to the Weierstrass Approximation Theorem any function f in 
Cla, b] can be approximated arbitrarily well from the set of all polynomi- 
als. Clearly, if instead of the set of all polynomials we restrict ourselves 
to the set of polynomials P,, of degree n or less, with n fixed, then it 
is no longer true that, for any f € C[a,] and any < > 0, there exists 
Pn © Pn such that 

If — Pnlloo < €- 

Consider, for example, the function x +> sin x defined on the interval 
(0, 7] and fix n = 0; then || f — g|l. > 1/2 for any gq € Po, and therefore 
there is no gq in Po such that || f — g|lo < 1/2. A similar situation will 
arise if Po is replaced by P,, with the polynomial degree n fixed.+ 

It is therefore relevant to enquire just how well a given function f in 
Cla, b] may be approximated by polynomials of a fixed degree n > 0. 
This question leads us to the following approximation problem. 


(A) Given that f € C[a,b] and n > 0, fixed, find p, € P, such that 
If Palloo = inf If —all 


such a polynomial p,, is called a polynomial of best approximation 
of degree n to the function f in the o-norm. 


The next theorem establishes the existence of a polynomial of best 
approximation, showing, in particular, that the infimum of || f — q|loo 
over q € Py is attained. We shall consider the question of uniqueness of 
the polynomial of best approximation later on, in Theorem 8.5. 


Theorem 8.2 Given that f € Cla, b], there exists a polynomial py € Pn 
such that || f — Pnlloo = mingep,, || f — qlloo- 


Proof Let us define the function (co,...,¢n) € R°*1 + E(co,.--,¢n) of 
n+ 1 real variables by 
E(co,---;€n) =||f —dnlloo, where dn(x) = co +--+ + en2”. 


1 This is due to the fact that, for any fixed n, Pp is a closed subset of Cla, 6]; #.e., 
if f does not belong to Pn, there exists ¢ > 0 such that 


inf - >e. 
qin, If — allo > € 
On the other hand, by the Weierstrass Theorem, the set of all polynomials is dense 


in C[a, b]: any continuous function f can be represented as a limit of a uniformly 
convergent sequence of polynomials (of, in general, increasing degree) on [a, }]. 
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We shall first show that F is continuous; this will imply that F attains 
its bounds on any bounded closed set in R”*!. We shall then construct 
a nonempty bounded closed set S C R"*? such that the lower bound of 
E on S is the same as its lower bound over the whole of R"*!. 

To show that E is continuous at each point (co,...,¢n) € R"*', con- 
sider any (69,..-,6n) € R"*! and define the polynomial n, € Pn by 
M(x) = 69 +--+ +6,2". We see from the triangle inequality that 


El cof 06;2<Cet On) = lf — (@n + In) loo 


If = dnlloo + [IrInlloo 
= E(co,---,€n) + |\Mnlloo - 


IA 


Now, for any given positive number ¢, choose 6 = ¢/(1+---+ K”), 
where K = max{|a|,|b|}. Consider any (69,...,6,) € R”*! such that 
|6;| < 6 for alli =0,...,n. Then, 


E(co ae 60, - sey Cn + 6n) _ E(co, oe “Cie < Ilr loo 
< max, ([6o| + [dillal +--+ [alal") 


ZHU A eae”) 


=€. (8.4) 
Similarly, 
E(Co;+++1€n) = If —dQrlloo = Ilf — (dn + mn) + Mlloo 
< [If — (an + Mn)lloo + IIMmnlloo 
< E(cot 60,---;Cn t+ On) +E, 
and therefore 
E(co,---,Cn) — E(co + 60,---;€n + On) SE. (8.5) 


From (8.4) and (8.5) we deduce that 
|E(co + 60,---;Cn +6n) — E(co,---5€n)| <€ 


for all (69,...,6n) € R™** such that |6;| < 6, 7 = 0,...,n, where now 
6=e/1+---+ K") and Kk = max{{al, |b|}. Hence £ is continuous at 
(co,--+;€n) € Rt. Since (co,...,¢n) is an arbitrary point in R"*?, it 
follows that E is continuous on the whole of R"*?. 

Let us denote by S the set of all points (co,...,¢€n) in R"*! such that 
E(co,---;€n) < ||flloo + 1. The set S is evidently bounded and closed 
in R"*1; further, S is nonempty since E(0,...,0) = ||flloo < ||fllo +1, 
so that (0,...,0) € S. Hence the continuous function F attains its 
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lower bound over the set S; let us denote this lower bound by d and let 


(cj,.--,¢%) denote the point in S where it is attained. 
Since (0,...,0) € S, it follows that 


d= min gE C09 ++ €n) S E(O,...,0) = [lflloo - 


According to the definition of S, 
E(co,.--,€n) > ||flloo +1 V(€0,-+-3Cn) ER" \ S. 


Hence, if (co,...,¢n) ¢ S, then E(co,...,¢n) > d+1 > d. Thus, the 
lower bound d of the function E over the set S is the same as the lower 
bound of E over all values of (co,..-,¢n) € R"*'. The lower bound d is 
attained at a point (cj,...,¢,,) in S; letting p} (x) = ch +---+c,2", we 
find that d= ||f — py||.0 and therefore p* is the required polynomial of 
best approximation of degree n to the function f in the oo-norm. 


Due to the nonconstructive nature of its proof, the last theorem does 
not actually tell us how to find a polynomial of best approximation of 
degree n for a given function f € C[a,b]. Therefore, our goal is now to 
devise a constructive characterisation of the property ‘py, 1s a polynomial 
of best approximation of degree n to the function f in the co-norm’. 
Before doing so, however, let us simplify our terminology. 

Writing the polynomial g € P,, in the form 


Qn(Z) = Cot +++ Cae”, 


we want to choose the coefficients cj, 7 = 0,...,n, so that they minimise 
the function FE: (cg,..., Cn) - E(co,..-,€n) defined by 
Eo Cig ates Gy) = lf — loo 
= max |f(#) — co —-+ ++ — cn 2”| 
x€ [a,b] 


over R™+!, Since the polynomial of best approximation is to minimise 
(over g € P;,) the maximum absolute value of the error f(x) — q(x) (over 
x € [a,0]), it is often referred to as the minimax polynomial; from 
now on, for the sake of brevity, we shall use the latter terminology. 

Before we embark on the constructive characterisation of the minimax 
polynomial of a continuous function, let us consider a simple example 
which illustrates some of its key properties. 


Example 8.3 Suppose that f € C[0,1], and that f is strictly monotonic 
increasing on [0,1]. We wish to find the minimax polynomial po of degree 
zero for f on [0,1]. 
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0 1 


Fig. 8.1. Minimax approximation po € Po of a strictly monotonic increasing 
continuous function f defined on the interval [0, 1]. 


The polynomial po will be of the form po(x) = co, and we need to 
determine cp € R so that 


If — Polls = max [f(2) ~ co 


is minimal. Since f is monotonic increasing, f(a) — co attains its mini- 
mum at « = 0 and its maximum at x = 1; therefore | f(x) — co| reaches 
its maximum value at one of the endpoints of (0, 1], 7-e., 


E(co) = nee |f(@) — co] = max {| f(0) — col, |f(L) — col} - 
Clearly, 


_ f f)-co if co < $(f(0) + f(1)) , 
Bled ={ 56 if co > 4(4(0) +f) . 


Drawing the graph of the function co € R +» E(co) € R shows that 


the minimum is attained when co = 4 (f(0) + f(1)). Consequently, the 


desired minimax polynomial of degree 0 for the function f is 


po(t) = 5(f0)+f0)), xe [0,1]. 


The function f and its minimax approximation po € Po are depicted in 
Figure 8.1. 

More generally, if f € C[a, 6] (not necessarily monotonic), and € and 
7 denote two points in [a, 6] where f attains its minimum and maximum 
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values, respectively, then the minimax polynomial of degree 0 to f on 
[a, b] is 


po(z) = 5 (f()+f(m)), —-v € [a,B}. 
7 


This example shows that the minimax polynomial po of degree zero 
for f € Cla,b] has the property that the approximation error f — po 
attains its extrema at two points, x = € and x = n, with the error 


F(x) — pox) = 5 (F(x) — F()) + 5 (F(@) — Fm) 


being negative at one point, x = €, and positive at the other, x = 7. 
We shall prove that a property of this kind holds in general; the precise 
formulation of the general result is given in Theorem 8.4 which is, due 
to the oscillating nature of the approximation error, usually referred to 
as the Oscillation Theorem: it gives a complete characterisation of the 
minimax polynomial and provides a method for its construction. We 
begin with a preliminary result due to de la Vallée Poussin.? 


Theorem 8.3 (De la Vallée Poussin’s Theorem) Let f € Cla, }| 
andr € P,. Suppose that there exist n+ 2 points ty < +++ < @n4y1 in 
the interval [a,b], such that f(x;) — r(a;) and f(ai41) — r(ai41) have 
opposite signs, fori =0,...,n. Then, 


min If —dlloc = ymin, fli) — (ea) (8.6) 


Proof The condition on the signs of f(a;)—r(a;) is usually expressed by 
saying that f—r has alternating signs at the points x;, 7 = 0,1,...,n+1. 
Let us denote the right-hand side of (8.6) by u. Clearly, w > 0; when 
i = 0 the statement of the theorem is trivially true, so we shall assume 
that 4 > 0. Suppose that (8.6) is false; then, for a minimax polynomial 
approximation pp € Pp to the function f we have? 


lf —Pnlloo = min If — allo <p. 


1 Charles Jean Gustave Nicolas, Baron de la Vallée Poussin (14 August 1866, Lou- 
vain, Belgium — 2 March 1962, Louvain, Belgium) made important contributions 
to approximation theory and number theory, proving in 1892 that the number of 
primes less than n is, asymptotically as n > oo, n/Inn. 

Recall from Theorem 8.2 that such py exists. 
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Therefore, 
Pn (aa) — F(aa)| < Ira) — fvs)|, @=0,1,...,.0 41. 
Now, 
7(xs) — Pn(wa) = [r(a@s) — F(a] — [Pn (ei) — F(aa)], t= 0,1,...,n +1. 


Since the first term on the right always exceeds the second term in 
absolute value, it follows that r(a;) — pp(a;) and r(a;) — f(a;) have the 
same sign for i = 0,1,...,2 +1. Hence r— p,, which is a polynomial of 
degree n, changes sign n+ 1 times. Thus, the assumption that (8.6) is 
false has led to a contradiction, and the proof is complete. 


Theorem 8.3 gives a clue to formulating a constructive characterisation 
of the minimaz polynomial: indeed, we shall show that if the quantities 
| f(a) —r(a;)|, 7 = 0,1,...,n+1, in Theorem 8.3 are all equal to || f—r|loo, 
then r € P», is, in fact, a minimax polynomial of degree n for the function 
f on the interval [a, 6]. 


Theorem 8.4 (The Oscillation Theorem) Suppose that f € Cla, b]. 
A polynomial r € Py, is a minimax polynomial for f on [a,b] if, and only 
if, there exists a sequence of n+2 points x;, 1 = 0,1,...,n +1, such 
that a<% <+++<@nii <b, 


[f(vi) —r(@)|=|F -rllo, ¢=0,1,...,n+1, 
and 
I) ale) = SF aa re) 7=0,...,n. 


The statement of the theorem is often expressed by saying that f —r 
attains its maximum absolute value with alternating signs at the points 
z;. The points z;,7 = 0,1,...,2+ 1, in the Oscillation Theorem are 
referred to as critical points. 


Proof of theorem If f € Pn, then the result is trivially true, with r= f 
and any sequence of n+2 distinct points x;, 7 = 0,1,...,n+1, contained 
in [a,b]. Thus, we shall suppose throughout the proof that f ¢ Pn, i.e., 
f is such that there is no polynomial p € P,, whose restriction to [a, }] 
is identically equal to f. 

The sufficiency of the condition stated in the theorem is easily shown. 
Suppose that the sequence of points x;, 7 = 0,1,...,2+ 1, exists with 
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the given properties. Define 
L=|f-rllo and = En(f) = min ||f — glloo- 
qEPn 


From De la Vallée Poussin’s Theorem, Theorem 8.3, it follows that 
E,(f) > L. By the definition of E,(f) we also see that E,(f) < 
lf —rllo = L. Hence E,(f) = L, and the given polynomial r is a 
minimax polynomial. 

For the necessity of the condition, suppose that the given polynomial 
r € Py, is a minimax polynomial for f on [a,b]. As x |f(x) —r(x)| isa 
continuous function on the bounded closed interval [a, b], there exists a 
point in [a,b] at which | f(a) — r(x)| attains its maximum value, L > 0; 
let 


Xo = min{x € [a,b]: | f(a) — r(x)| = L}. 


Now, vo = 6 would imply that | f(a) —r(a)| = L for all x € [a,b]. As f is 
continuous on [a,b], it would then follow that either f(x) = r(x) + L for 
all x € [a, b] or f(x) = r(x) —L for all x € [a,b]; either way, we would find 
that f € Pn, which is assumed not to be the case. Therefore, xo € [a, b); 
we may assume without loss of generality that f(xo) — r(ao) = L > 0. 

Now, we shall prove the existence of the next critical point, 71 € (Zo, 0] 
such that f(v1) — r(a1) = —L. Suppose otherwise, for contradiction; 
then, —L < f(a) — r(x) < L for all x in [a, 6]. Thus, by the continuity 
of f, there exists 6 € (0, Z) such that —L +6 < f(x) — r(x) < L for all 
x € [a,b]. Let us define r* € P,, by 


r'(@)=r(a)+e, 
where 0 < ¢ < min{6, L} = 6. Then, for all x € [a, 0], 
f(a) —r*(@) = f(z) - r(x) -e > -L+6-e>-L 


and 


which means that 
lf —r*llo < L=||f -Tlloo- 


Hence, r* € P,, is a better approximation to f on [a,b] than r € P,, is. 
This, however, contradicts our hypothesis that r is a polynomial of best 
approximation to f on [a,b] from P,,, and implies the existence of 


x1 = inf{x € (xo, b]: f(x) — r(a) = —L}. 
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Consequently, f(a1) — r(a1) = —L and x, € (20, 6], as required; thus if 
n = 0, the proof is complete. 

Let us, therefore, suppose that n > 1, and successively define the 
critical points 


xy = inf{x € (a1, 6]: f(z) — r(x) = (-1)'L}, t= Lwkyms 


I 


continuing either until z,, = b or until we find an x, < 6 such that 
| f(x) — r(a)| < L for all x € (am, 6]. Now, either m > n+ 1, and then 
the proof is complete as we will have found n+ 2 critical points, rp < 
Ly <+++ <%n41 in [a,b], with the required properties, or 1 <m <n. 

To complete the proof of the theorem, we shall show that the second 
alternative, 1 <m <n, leads to a contradiction, and is, therefore, not 
possible. Let us suppose, for this purpose, that 1 < m < n, and let 
"o = a. Further, observe that, due to the definition of the points 2, 
7=0,1,...,m, 


An: € (ti-1,0;) Vr elm,zi) |f(x)—r(z)|<L, i=1,...,m, 
and define 741 = 0b. 
It follows from the choice of the n;, 7 = 0,1,...,m+ 1, that the 
following properties hold: 
(a) |f(@) —r(x)| < L for all x € [n, ni41] and all i = 0,1,...,m; 
(b) for each 7 = 0,1,...,m there exists x € [n:,m+1] (say, « = 2%), 
such that f(x) — r(x) = (-1)'L; 
(c) there exist noi € {0,1,...,m} and a € [7,741] such that f(x) — 
r(a) = (“DE 
(d) |f(m) — r(m)| < £ for alli =1,...,m. 
Now, let 


and define 
r*(a) = r(x) + cv(z), 
where € > 0 is a fixed real number, to be chosen below. Since, by 


hypothesis, 1 < m < n, it follows that r* € P,. Let us consider the 
behaviour of the difference 


f(a) — r*(2) = f(a) — r(2) — ev(z) 
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on each of the intervals [7;, 7:41], i = 0,1,...,m (whose union is [a, }]). 
We shall prove that, for ¢ > 0 sufficiently small, 


[f(2) —r*(a)| < L=|[f — rlloo 


for all x in [m, 7:41] and all 7 =0,1,...,m; ie., || f—T* loo < || f —7rllo, 
contradicting the fact that r € P, is a minimax polynomial for f on 
(a, b], and refuting the hypothesis that 1 <m <n. 

Take, for example, the interval [70,71]. For each x in [7,17) we have 
v(x) > 0 and therefore, by the definition of r* (a) and property (a) above, 


f(a) —r*(a) < L-ev(a) < L, x € [No,m)- 
Further, as v(71) = 0, it follows from (d) that 


f(m) —r"(m) = f(m) — r(m) < L. 
Therefore, f(x) — r*(a) < L for each x in [n,m]. For a lower bound 
on f(x) — r*(x), note that by (a) and (c), f(x) — r(x) > —L for all 
x in [No,m]. As f — ris a continuous function on [79,7], there exists 
6, € (0, ZL) such that f(a) — r(a) > —L + 6, for all x in [n,m]. Thus, 
for 0 << e < min{L,6,,¢,}, where 
61 


MAXS € [No.1] |v(a)| 


ey = 


>) 


we have that 
f(z) —r*(a) > —L+ 6, —elv(a)| > -L, x € [No,m)- 
archer: bend) above, 
f(m) —9*(m) = fm) — rm) > -£. 


Hence, f(x) —r*(a) > —L for all x € [n,m], for 0 < e < min{L, 61, ey}. 
Combining the upper and lower bounds on f(x) —r*(x), we deduce that 


If(z) —r"(a)|<L=|f-rllo, © € [n,m]. 


Arguing in the same manner on each of the other intervals [j;, 7:41], 
tS dy ceed, with 0 <e< min{ L, 6:41, €i41}, $= Vy My and Oi41 
and €;+41 defined analogously to 61 and ¢; above, we conclude that 


lf(z) —r*(x)|< L=||f —rllo, DAS. |e Ts 1=0,1,...,m, 
and hence, for 0 < ¢ < min{L, 61, €1,...,6m+41,€m+1}; 


If — "lho < B= ||f — rlloo- 
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Pi 


{ 0 q 


Po P, 


Fig. 8.2. The Oscillation Theorem: the difference f(x) — r(x), where r is a 
cubic approximation to a continuous function f, and the effect of replacing 
r(x) by r*(x) = r(x) — ev(x), where v(x) = (m1 — x)(n2 — @). 


As r* is in Pp, the last inequality contradicts our assumption that r is 
a polynomial of best approximation to f on [a,b] from P,,. The contra- 
diction rules out the possibility that 1 <<m <n. Since m > 1, it follows 
that m > n+ 1, and the proof is complete. 


In the proof of the Oscillation Theorem we supposed, without loss of 
generality, that f(vo) — r(vo) = L > 0, where L = ||/f — rl|o. When 
f(ao) — r(xo) = —L < 0 the proof is analogous, except we then define 
r*(x) = r(x) — € to prove the existence of the critical point x1 € (0, )] 
and, in the discussion of the case 1 < m <n, we let 


r(a) = r() — ev(2), 


with v(x) and € > 0 defined as before. 

A typical situation is illustrated in Figure 8.2, which represents the 
difference f — r, where r is a polynomial approximation of degree 3 to 
a continuous function f. Here |f —r| attains its maximum value with 
alternate signs at the points Py, P; and Pj, so that m=2 <n =3. 
Let xo, x; and x2 denote the x-coordinates of Py, Pi, P2, respectively. 
Clearly, f(vo) — r(wo) = —L < 0, where L = ||f —r||.o. Also, the two 
points 7, and m2 are as shown, u(x) = (m1 — x)(m2 — x), and the effect 
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of replacing r by r*(x) = r(a) — ev(x), with e > 0, is indicated by the 
arrows. Since f —r* = f —r+ev(a) and v is negative for x € (™m, 72) 
and positive outside (71,72), |f — 7r*| will be smaller than |f — r| at 
each of the points P;, 7 = 0,1,2. There are two other local extrema 
for the error function f — r: a minimum at @ and a maximum at R. 
Since both these points are to the right of 72, where v(x) > 0, we shall 
have f —r* > f —r at both of Q and R, and |f — r*| > |f —r| at R. 
The magnitude of the extra term ev(x) must therefore be limited by the 
need to avoid the new difference f —r* becoming too large at R. We can 
achieve this by selecting ¢ > 0 sufficiently small. In this illustration the 
polynomial r € Ps is not a minimax approximation to f on the given 
interval, since we can construct a better approximation r* which is also 
in P3. 

We can now apply the Oscillation Theorem to prove that the minimax 
polynomial is unique. 


Theorem 8.5 (Uniqueness Theorem) Suppose that [a, 6] is a bounded 
closed interval of the real line. Each f € Cla, b] has a unique minimax 
polynomial py, € Py on [a, bj. 


Proof Suppose that gq, € Py is also a minimax polynomial for f, and 
that pp, and qn are distinct. Then, 


II f =Prlles _ If _ An|loo _ En(f) 


where, as in the proof of the Oscillation Theorem, we have used the 
notation 


This implies, by the triangle inequality, that 


le $(Pn + dn)|loo = IS (f — pn) + $(f —Gn)lloo 
< s\lf Poo s\lf Gnlloo 


Therefore $(pp +4n) € Pn is also a minimax polynomial approximation 
to f on [a,b]. By the Oscillation Theorem there exists a sequence of 
n-+ 2 critical points 7;, 1 =0,1,...,2 +1, at which 


| f(x.) _ 3 (Pn (xi) + dn(z:))| = En(f), i= 0, 1, rey Mt 1. 
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This is equivalent to 


| (F(@i) — Pn(ai)) + (F (i) — In(@a)) | = 2En(f)- 


Now 


|f(@) — Pn(ai)| < max | f(a) — pn(x)| = Il f — pnlloo = En(f), 


x€ [a,b] 


and, for the same reason, 


lf (vi) — Qn(zi)| S En(f)- 
It therefore follows! that 


f (ti) — pn(xi) = f(ai) —an(ei), 4=0,1,...,n4+1. 


Thus, the difference py, — gn vanishes at n+2 distinct points. As py — dn 
is a polynomial of degree n or less, it follows that pp, — dn is identically 
zero. This, however, contradicts our initial hypothesis that p,, and qd, are 
distinct, and implies the uniqueness of the minimax polynomial p, € Py 
for f € C[a, 8]. 


As an application of the Oscillation Theorem, we consider the con- 
struction of the minimax approximation p; € P, of degree 1 to a function 
f € C[a, }] on the interval [a, b], where we assume that f has a continuous 
and strictly monotonic increasing derivative f’ on this interval. 

We seek the minimax polynomial p; € P, in the form pi(x) = c1x+Cp. 
The difference f(x)—(c,x+co) attains its extrema either at the endpoints 
of the interval [a,b] or at points where its derivative f’(x) — ci is zero. 
Since f’ is strictly monotonic increasing it can only take the value c; at 
one point at most. Therefore the endpoints of the interval, a and b, are 
critical points. Let us denote by d the third critical point whose location 
inside (a,b) remains to be determined. Since the critical point + = d is 
an internal extremum of f(x) — (ca + co), it follows that 


(f(x) — (ext + €))'|,—-q = 0- 


By the Oscillation Theorem, with ro = a, x, = d, £2 = b, we have the 


1 We use the following elementary result: if P and Q are two real numbers and E 
is a nonnegative real number such that |P + Q| = 2, |P| < E and |Q| < E, 
then P = Q. This follows by noting that (P — Q)? = 2P? + 2Q? —-(P4+Q)? < 
2E? +2E? —4E? = 0, and hence P—Q = 0. In the proof of the theorem we apply 
this with P = f(a) — pn(ai), Q = f(a) — an(ai) and E = E,(f). 


240 & Polynomial approximation in the co-norm 


Fig. 8.3. Construction of minimax polynomial of degree 1. 


equations 
f(a) —(cia+co) =A, 
f(d) = (cid + co) = —A, (8.7) 
f(0) — (cib+ co) = A, 
where either A = L or A = —L, with L = maxzeja.y | f(x) — pi(2)|.- 
Along with the condition 


f(d)=Qy (8.8) 
this gives four equations to determine the unknowns d, c1, co and A. 

Subtracting the first equation in (8.7) from the third equation, we get 
f(b) — f(a) = ca(b — a), whereby c, = (f(b) — f(a)) /(b— a). Now, by 
the Mean Value Theorem, Theorem A.3, with this choice of cy equation 
(8.8) has at least one solution, d, in the open interval in (a,b). In fact, 
the value of d is uniquely determined by (8.8), as f’ is continuous and 
strictly monotonic increasing. Next, co can be determined by adding 
the second equation in (8.7) to the first. Having calculated both c; and 
co we insert them into the first equation in (8.7) to obtain A; finally 
L=|Al. 

The construction of the minimax polynomial p; is illustrated in Figure 
8.3; R is the point at which the tangent to the curve y = f(x) is parallel 
to the chord PQ; the graph of p(x) is parallel to these two lines, and 
lies half-way between them. 
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Table 8.1. The first seven Chebyshev Polynomials: To,T,,...,T6. 


To (x) = 1 

Ti(z4) = & 

To(x) = 2? -1 

T3(2) = 4a° — 32 

Ta(z) = 8a*— 82? 41 

Ts(z) = 16x° — 202° + 5a 
Te(z) = 320° — 48244 1842? -—1 


8.4 Chebyshev polynomials 


There are very few functions for which it is possible to write down in 
simple closed form the minimax polynomial. One such problem of prac- 
tical importance concerns the approximation of a power of x by a poly- 
nomial of lower degree. The minimax approximation in this case is given 
in terms of Chebyshev polynomials. 


Definition 8.2 The Chebyshev polynomial T,, of degree n is defined, 
for x € [-1,1], by 


Tn(x) = cos(ncos~* 2) , (eae i ey eee 


Despite its unusual form, 7}, is a polynomial in disguise. For example, 
To(x) = 1, Ti(x) = «& for all x € [—1,1], and so on. In order to show 
that this is true in general, we recall the trigonometric identity 


cos (n + 1)3 + cos (n — 1)0 = 2cosVcosnv, 
and set 0 = cos! x, with x € [—1, 1], to obtain the recurrence relation 
Tn4i(x) = 22T,(x) — Tri (2), n=1,2,3,..., 2 €[-1,1]. 


Since To and T; have already been shown to be polynomials on [—1, 1], 
we deduce from this recurrence relation, by induction, that Tj), is a poly- 
nomial of degree n on [—1,1] for each n > 0. A list of the first seven 
Chebyshev polynomials is given in Table 8.1. 


1 Pafnuty Lvovich Chebyshev (16 May 1821, Okatovo, Russia — 8 December 1894, 
St Petersburg, Russia). In 1850 Chebyshev proved the Bertrand conjecture, that 
there is always at least one prime between n and 2n for n > 2. He also came close 
to proving the Prime Number Theorem which states that the number of primes 
less than n is, asymptotically as n — oo, n/Inn. The proof was completed, 
independently, by Dirichlet and de la Vallée Poussin two years after Chebyshev’s 
death. Chebyshev made important contributions to probability theory, orthogonal 
functions and the theory of integrals. 
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Fig. 8.4. The first three Chebyshev polynomials of (a) even degree, To, T2, 
T4, and (b) odd degree T;, T3, Ts, plotted on the interval {[—1, 1]. 


The polynomials To, T>, T,, and T;, T3, 75, are depicted in Figure 8.4. 
We see that the even-degree Chebyshev polynomials are even functions; 
(t.e., Tox(—2) = Tox(x) for all x € [—1,1]) and the odd-degree ones are 
odd functions (i.e., Tox41(—2) = —Tor4i(x) for all x € [-1,1]). They 
all map the interval [—1, 1] into itself." 

The proof of the next lemma is straightforward and is left as an exer- 
cise (see Exercise 10). 


Lemma 8.2 The Chebyshev polynomials have the following properties: 


(i) T4i(x) = 20Ta(2) —Ty-1(z), « € [-1,1], n=1,2,3,...; 
(ii) forn > 1, Ty, ts a polynomial in x of degree n on the interval 
[—1,1], with leading coefficient 2”~'x”; 
(iii) T, is an even function on [—1,1] ifn is even, and an odd function 
on [—1,1] ifn is odd, n > 0; 
(iv) forn > 1, the zeros of T, are at 


(27 —1)x 


2; = cos 
7 Qn , 


PHT yeeey 


In Maple, typing plot (orthopoly[T](7,x), x=-1..1, y=-1..1); will, for 
example, plot the graph of the Chebyshev polynomial T7 of degree 7 in x; Tg, To, 
etc., can be obtained similarly. Incidentally, you may be wondering why T;, and 
not C’, is used to denote the Chebyshev polynomial of degree n. The reasons are 
largely historical: in some older books and articles Chebyshev’s Russian surname 
has been transliterated from the Cyrillic original as Tchebyshev, following the 
French and German transliterations Tchebychef and Tschebyscheff, respectively. 
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they are all real and distinct, and lie in (—1,1); 

(v) |Tn(av)| <1 for all x € [-1,1] and all n > 0; 

(vi) forn > 1, T,(#) = +1, alternately at the n+ 1 points x, = 
cos(kr/n), k =0,1,...,n. 


We can now apply the Oscillation Theorem to construct the minimax 


polynomial of degree n for f: c++ x"*+ on the interval [—1, 1]. 


Theorem 8.6 Suppose thatn >0. The polynomial py € Pn defined by 
pala) =a2"** =2-°T 1 (2), x€[-1,1], 


is the minimax approximation of degree n to the function «> «+! on 
the interval [—1, 1]. 


Proof By part (ii) of Lemma 8.2, pn € Pn. Since 


gh pg) 2-1 i (e) 


by parts (v) and (vi) of Lemma 8.2, the difference 2”*! — p,,(a) does not 


exceed 2~” in the interval [—1, 1], and attains this value with alternat- 
ing signs at the n + 2 points x, = cos(ka/(n+1)), k =0,1,...,n4+1. 
Therefore, by the Oscillation Theorem, p,, is the (unique) minimax poly- 
nomial approximation from P,, to the function 2 + x”! over [1,1]. 


A polynomial of degree n whose leading coefficient, the coefficient 
of x”, is equal to 1, is called a monic polynomial of degree n. For 
example, the polynomial r € P;,41 defined by r(x) = «"*! — q(x) with 
qd € Pn, is a monic polynomial of degree n + 1. 


Corollary 8.1 Suppose that n > 0. Among all monic polynomials of 
degree n+1 the polynomials 2~-"T,,4, and —2~"T,,4, have the smallest 
oo-norm on the interval [—1, 1]. 


Proof Let P},; denote the set of all monic polynomials of degree n+ 1. 


Any r € Pi 4, can be regarded as the difference between the function 


x +> xt and a polynomial of lower degree, i.e., r(x) = «”*! — q(x) 


with g € Py. Hence, by Theorem 8.6, 


min |r = min |j¢?tt— 
sean lle = ain lle" ~ ale 


[lant — (a — 2° Th 41) loo 


I 


= |l2-"Trtrlloo; 
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1 


the minimum is, therefore, achieved when r € P;, 


polynomials 2~"7),41 or —27"T),41. 


41 is one of the monic 


8.5 Interpolation 


We close the body of this chapter with another application of Chebyshev 
polynomials: it concerns the ‘optimal’ choice of interpolation points in 
Lagrange interpolation. In Chapter 6 the error between an n+ 1 times 
continuously differentiable function f, defined on a closed interval [a, b] 
of the real line, and its Lagrange interpolation polynomial p,, of degree 


n, n => 0, with interpolation points &9,...,&,, was shown to have the 
form 
af OG 
f(x) — pr(z) = (nt)! Tn41(Z) , (8.9) 
where 7 = (x) € (a, 6) and 
Tnti(@) = (a — €)...(u— En). (8.10) 


Clearly, tn+1 is a monic polynomial of degree n+ 1. 

In a practical application the values €; and f(&;), i =0,1,...,n, may 
be already given. However, in a situation where [a,b] = [—1,1] and the 
&, t= 0,1,...,n, can be freely chosen in the interval [—1, 1], Corollary 
8.1 suggests that they should be taken as the zeros of the Chebyshev 
polynomial T,,,1, for then 7,41 will have the smallest oo-norm on the in- 
terval [—1, 1] among all monic polynomials. This observation motivates 
the following result. 


Theorem 8.7 Suppose that f is a real-valued function, defined and 
continuous on the closed real interval [a,b], and such that the derivative 
of f of order n+ 1 is continuous on [a,b]. Let pn € Pn denote the 
Lagrange interpolation polynomial of f, with interpolation points 


then 
(b i ayer 


lf — Priloo < B2nF1(y Fay nt 


where Mn+ = Max¢efa,p] Rial @ p 
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Proof Let T; = cos ((j + 5)a/(n + 1)), j =0,1,...,n, denote the zeros 
of the polynomial T;,+1(¢) (in the interval (—1,1)). Hence, 


[[@-7) =2°-"tnui(), 9 t€[-1,1]. 

j=0 
Let us define the points €;, 7 = 0,1,...,n, as in the statement of the 
theorem. Clearly €; € (a, b) is the image of 7; € (—1, 1) under the linear 
transformation t > x = $(b—a)t + (b+ a); we note in passing that 
the inverse of this mapping is x + t(a) = (22 — a — b)/(b— a); thus, 


I]e-&) = (2) Tee -ay= (G2) 2etaeatee, 


j=0 


The required bound now follows from (8.9), since |Tn41(t(x))| < 1 for 
all x € [a, 6], and therefore |tn+41(z)| < (b— a)?*127-2"-1, 


The De la Vallée Poussin Theorem, Theorem 8.3, suggests the no- 
tion of a near-minimax polynomial, which is a polynomial p, € Py 
such that the difference f(x) — p(x) changes sign at n + 1 points 
&, j = 0,1,...,n, with a < & <--- < & < 6; for the difference 
f(x) — pn() then attains a local maximum or minimum with alternat- 
ing signs in each of the intervals [a, ), (€0,&1),---,(&n, 6]. The posi- 
tions of these alternating local maxima and minima are then the points 
x;, 7=0,1,...,2+1, required by Theorem 8.3, and we therefore know 
that the oo-norm of the error of the minimax polynomial lies between 
the least and greatest of the absolute values of these local maxima and 
minima. In particular, we should expect that if the sizes of these local 
maxima and minima are not greatly different, then the error of the near- 
minimax approximation should not be very much larger than the error 
of the minimax approximation. 

Given any set of points €;, i =0,1,...,n, witha<&<---<& <6, 
the polynomial 741(a) = (aw — &9)...(a — &)) changes sign at the n+1 
points €;, 7 = 0,1,...,n. Let us assume that f € C[a,b], f(t) exists 
and is continuous on [a,b], and f("*+!) has the same sign on the whole 
of (a,b). It then follows that the product f(+ (7)mn41(x) has exactly 
n+ 1 sign-changes in the open interval (a,b) for any 7 € (a,b). Thus, 
according to (8.9), the Lagrange interpolation polynomial p,, of degree n 
for the function f, with interpolation points €;, 7 = 0,1,...,n, contained 
in the open interval (a,b), is a near-minimax polynomial from P,, for f 
on [a,b]. 

We have therefore just shown that if f+ exists and is continuous 
on the closed interval [a,b], and has the same sign on the open interval 
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(a,b), then the polynomial constructed by interpolating at the points 
€}, J = 9,1,...,n, obtained by linearly mapping the n + 1 zeros of the 
Chebyshev polynomial T;,+4:(¢) from (—1,1) to (a,b), is a near-minimax 
approximation from P,, for the function f € Cla, b] on the interval [a, 0). 
Notice that if we use equally spaced interpolation points, so that €; = 
at+j(b—a)/n, j =0,1,...,, n > 1, we shall not obtain a near-minimax 
approximation, since the interpolation error now changes sign at only 
n —1 points, the interpolation points which are internal to (a, b). 


Fig. 8.5. Comparison of two polynomial approximations to e?” on [0,1]: the 
thick curve is the error of the minimax approximation; the thin curve is the 
error of the polynomial obtained by interpolation at the Chebyshev points. 


As an illustration, Figure 8.5 shows the errors of two approximations 
of degree 4 to the function f(x) = e?* over the interval [0,1]. One 
of these is the minimax approximation, and the other is obtained by 
interpolation at the zeros of Ts(t). It is clear that they are quite close; 
in fact the oo-norms of the errors are 0.0015 and 0.0017 respectively. 

In the next chapter we shall show that the least squares polynomial 
approximation to a continuous real-valued function is also near-minimax 
in this sense. 

An alternative and very easy way of constructing polynomial approx- 
imations to many simple, smooth, functions is to truncate their Taylor 
series expansion. For example, 
keg 


ek —~L tke pee fee, 
n! 
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so we obtain a polynomial approximation p,(x) by taking the terms of 
this series up to the one involving x”. Then, clearly, 


- aa ed 
ae — Pn(x) = S- a 


r=n+1 


Over the interval [0,1], for example, this difference is nonnegative and 
monotonic increasing; it does not change sign at all. Hence the polyno- 
mial p, € P,, thus constructed is quite certainly not a near-minimax ap- 
proximation for x + e*” on [0,1]. Nevertheless, max,¢(0,1] |e*” — Pn(2)| 
can be made arbitrarily small by choosing n sufficiently large. 


8.6 Notes 


For further details on the topics presented in this chapter, we refer to 


»’ M.J.D. POWELL, Approximation Theory and Methods, Cambridge 
University Press, Cambridge, 1996. 


The Weierstrass Theorem is discussed in Chapter 6 of that book, and 
is stated in its Theorem 6.3. Although the proof presented by Powell 
uses the Bernstein polynomials, it is different from the more elementary 
but slightly lengthier argument proposed in Exercise 12 here: it relies 
on a proof of Bohman and Korovkin based on properties of monotone 
operators; see, also, p. 66 in Chapter 3 of 


» E.W. CHENEY, Introduction to Approximation Theory, McGraw-Hill, 
New York, 1966. 


The notes contained on pp. 224—233 of Cheney’s book are particularly 
illuminating. 

The proof of the Weierstrass Theorem as proposed in Exercise 12, 
including the definition of what we today call Bernstein polynomials, 
stem from a paper of Sergei Natanovich Bernstein (1880-1968), entitled 
‘Démonstration du théoréme de Weierstrass fondée sur le calcul des prob- 
abilités’, Comm. Soc. Math. Kharkow 18, 1-2, 1912/13. 

Weierstrass’ main contributions to approximation theory, as well 
as those of other mathematicians (including Picard, Volterra, Runge, 
Lebesgue, Mittag-Leffler, Fejér, Landau, de la Vallée Poussin, Bern- 
stein), are reviewed in the extensive historical survey by Allan Pinkus, 
Weierstrass and approximation theory, J. Approz. Theory 107, 1-66, 
2000. Further details about the history of the subject can be found at 
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the history of approximation theory website maintained by Allan Pinkus 
and Carl de Boor: http://www.cs.wisc.edu/” deboor/HAT/ 

The second part of Theorem 8.1 concerning the approximability of 
a continuous function by polynomials in the 2-norm is not usually pre- 
sented as part of the classical Weierstrass Theorem which is posed in the 
oo-norm. Here, we have chosen to state these results together in order 
to highlight the analogy, as well as to motivate the use of the 2-norm in 
polynomial approximation in the next chapter, Chapter 9. 

In both Cheney’s and Powell’s books minimax approximation is treated 
in the more general framework of Haar systems. An (n+ 1)-dimensional 
linear subspace A of C[a, }] is said to satisfy the Haar condition if, for 
every nonzero p in A, the number of roots of the equation p(x) = 0 
in the interval [a,b] is less than n+ 1. The concept of Haar system is 
due to Alfred Haar (1885-1933), Die Minkowskische Geometrie und die 
Annaherung an stetige Funktionen, Math. Ann. 78, 294-311, 1918; this 
paper contains Haar’s Theorem which characterises finite-dimensional 
Haar systems in spaces of continuous functions. The Characterisation 
Theorem, formulated as Theorem 7.2 in Powell’s book, shows that the 
Oscillation Theorem, Theorem 8.4 of the present chapter, remains valid 
in a more general setting when the set of polynomials {1,z,...,2"} is 
replaced by an (n + 1)-dimensional Haar system of functions contained 
in C[a, 6]. 


Exercises 


8.1 Give a proof of Lemma 8.1. 

8.2 Suppose that the real-valued function f is continuous and even 
on the interval [—a, a], that is, f(a~) = f(—2) for all x € [—a, al. 
By using the Uniqueness Theorem, or otherwise, show that the 
minimax polynomial approximation of degree n is an even func- 
tion. Deduce that the minimax polynomial approximation of 
degree 2n is also the minimax polynomial approximation of de- 
gree 2n+1. What does this imply about the sequence of critical 
points for the minimax polynomial p2,,? 


8.3 State and prove similar results to those in Exercise 2, for the 
case where f is an odd function, that is, f(a) = —f(—2) for all 
x € [-a, al. 

8.4 (i) Construct the minimax polynomial pp € P2 on the interval 


[—1,1] for the function g defined by g(x) = sin z. 


8.5 


8.6 


8.7 


8.8 
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(ii) Construct the minimax polynomial p3 € P3 on the interval 
[—1, 1] for the function h defined by h(x) = cos x?. 

(Use the results of Exercises 2 and 3.) 

The function H is defined by H(z) = 1 if x > 0, H(x) = -1 
if « < 0, and H(0) = 0. Show that for any n > 0 and any 
Pn € Pn; ||H — pn|lo => 1 on the interval [—1,1]. Construct 
the polynomial, of degree 0, of best approximation to H on the 
interval [—1,1], and show that it is unique. (Note that since H 
is discontinuous most of the theorems in this chapter are not 
applicable.) 

Show that the polynomial of best approximation, of degree 

1, to H on [—1,1] is not unique, and give an expression for its 
most general form. 
Suppose that ty < to <--: < ty are k distinct points in the 
interval [a, b]; for any function f defined on [a, 6], write Z;,(f) = 
max*_, | f(a;)|. Explain why Z,(-) is not a norm on the space of 
functions which are continuous on [a, 6]; show that it is a norm 
on the space of polynomials of degree n, provided that k > n. 

In the case k = 3, with t; = 0, te 5 
wish to approximate the function f: x + e” on the interval 
(0, 1], explain graphically, or otherwise, why the polynomial p; 
of degree 1 which minimises Z73(f — p,) satisfies the conditions 


F(0) — p1(0) = —[F(3) — pil (Q) = FL) — pa (1). 


Hence construct this polynomial p,;. Now suppose that k = 4, 
with t; = 0, te T ts z, t4 a3 use a similar method to 
construct the polynomial of degree 1 which minimises Z4(f—p,). 
Among all polynomials p,, € P, of the form 


tz; = 1, where we 


n-1 
Pn(x) = Ax” + yy ae”, 
k=0 


where A is a fixed nonzero real number, find the polynomial 
of best approximation for the function f(a) = 0 on the closed 
interval [—1, 1]. 

Find the minimax polynomial p, € P,, on the interval [—1, 1] 
for the function f defined by 


n+1 


f(a) = S- a,a*, 
k=0 
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8.9 


8.10 
8.11 


8.12 
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where @n41 4 0. 
Construct the minimax polynomial p; € 7; on the interval 
[—1, 2] for the function f defined by f(x) = |z]. 
Give a proof of Lemma 8.2. 
Give an example of a continuous real-valued function f defined 
on the closed interval [a, 6] such that the set of critical points for 
the minimax approximation of f by polynomials from P; does 
not contain either of the points a and b. 
For each nonnegative integer n, and x € [0,1], define the Bern- 
stein polynomials prz € Py by 
n! 
Purl) = Fit — By 


Show that 


ear, ka Us cn: 


Laas) = S- Pnx(a)t® 
k=0 


by differentiating this relation successively with respect to t and 
putting t = 1, show that, for any 2 € [0,1], 


» Pnk(@) = 1 ’ 
k=0 
S- kpny(a@) = na, 
k=0 


a k(k—1)pnz(2) = n(n—1)2?, 
k=0 


and deduce that 


So (a — k/n)?Pnx() = sie , # € [0,1]. 
k=0 
Define M to be the upper bound of |f(a)| on [0,1]. Given 
€ > 0, we can choose 6 > 0 such that |f(x) — f(y)| < €/2 for 
any x and y in [0,1] such that |a — y| < 6. Now define the 
polynomial p, € P, by 


Exercises 251 


and choose a fixed value of x in [0,1]; show that 
f(a) —pala)| SSO IF (@) — f(k/n) pan (2) - 


Using the notation 


where }°>, denotes the sum over those values of k for which 
|z — k/n| < 6, and 5°, denotes the sum over those values of k 
for which |z — k/n| > 6, show that 


a) f(k/n)|pne (a) <¢/2. 


Show also that 


n 


dN) f(k/n)|Pna(@) < (2M/8?) SO (a — k/n)?pnk(2) - 


k=0 
Now, choose No = M/(67¢), and show that 
f(x) -palx)|<e Vee (0,1), 
ifn > No. Deduce that 
IIf—Pnllo <¢€, ifn >No, 


where || - ||. denotes the oo-norm on the interval [0, 1]. 
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Approximation in the 2-norm 


9.1 Introduction 


In Chapter 8 we discussed the idea of best approximation of a continuous 
real-valued function by polynomials of some fixed degree in the oo-norm. 
Here we consider the analogous problem of best approximation in the 
2-norm. Why, you might ask, is it necessary to consider best approxima- 
tion in the 2-norm when we have already developed a perfectly adequate 
theory of best approximation in the oo-norm? As our first example in 
Section 9.3 will demonstrate, the choice of norm can significantly influ- 
ence the outcome of the problem of best approximation: the polynomial 
of best approximation of a certain fixed degree to a given continuous 
function in one norm need not bear any resemblance to the polynomial 
of best approximation of the same degree in another norm. Ultimately, 
in a practical situation, the choice of norm will be governed by the sense 
in which the given continuous function has to be well approximated. 

As will become apparent, best approximation in the 2-norm is closely 
related to the notion of orthogonality and this in turn relies on the 
concept of inner product. Thus, we begin the chapter by recalling from 
linear algebra the definition of inner product space. 

Throughout the chapter [a, b] will denote a nonempty, bounded, closed 
interval of the real line, and (a, b) will signify a nonempty bounded open 
interval of the real line. 
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9.2 Inner product spaces 


Definition 9.1 Let V be a linear space over the field of real numbers. 
A real-valued function (-,-), defined on the Cartesian product V x V, is 
called an inner product on V if it satisfies the following axioms: 

0 (f+g,h) =(f,h) + (g,h) for all f, g andh inY; 

@ (A\f,9) =A(f,g) for all X in R, and all f, g inV; 

® (f,9) =(9,f) for all f and g inV; 

@ (ff) >0ffA#0, fer. 


A linear space with an inner product is called an inner product space. 


Example 9.1 The n-dimensional Euclidean space R” is an inner prod- 
uct space with 


n 
Gye. ta x,yEeR", 
i=1 


where @ = (x1,...,%n)? and y = (y1,---,Yn)'. We can also write this 
in a more compact form as (@,y) = «ly. 


Definition 9.2 Suppose that V is an inner product space, and f and g 
are two elements of V such that (f,g) = 0; we shall then say that f is 
orthogonal to g. 


Due to the third axiom of inner product, if f is orthogonal to g, then 
g is orthogonal to f; therefore, if (f,g) = 0, we shall simply say that f 
and g are orthogonal. Our next example shows that Definition 9.2 is a 
direct generalisation of the usual geometrical notion of orthogonality. 


Example 9.2 According to Example 9.1, with n = 2, the formula 
(y,z) =y*z, where y = (y1, ye)? and z = (1, z2)? are two-component 
vectors, defines an inner product in R?. 

The vectors y and z have respective lengths \/y? + ys = |lyll2 and 
VJ 24+ 25 = |lz\|2, where || - |lz denotes the 2-norm for vectors in R?. 
Let a € [0, 27) denote the angle, measured in an anticlockwise direction, 
between the positive x ,-coordinate direction and y; similarly, let B € 
(0,27) be the angle between the positive x,-coordinate direction and z. 
Then, 


y = |lyll2(cosa,sina) and z= |/z\|2(cos G,sin @). 
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Now, 
(yz) = yrz 

= |lylla||z||2 (cos acos 6 + sin asin (3) 

= |lyll2ll2ll2 cos(a — 8) 

= |lyll2llzllzcos(yz) , 
where Jy, = |a— | is the angle between the vectors y and z. The vector 
y is orthogonal to z if, and only if, Vy, is 7/2 or 30/2; either way, 
cos(Vyz) = 0, and hence (y,z) = 0. We note in passing that if y = z, 
then Jy, = 0 and therefore 

(yy) = Ilyll- 


This last observation motivates our next definition. 


Definition 9.3 Suppose that V is an inner product space over the field 
of real numbers, with inner product (-,-). For f in V, we define the 
induced norm 


Ifl= AL? (9.1) 


Although our terminology and our notation appear to imply that (9.1) 
defines a norm on VY, this is by no means obvious. In order to show that 
fe (f, f)!/? is indeed a norm, we begin with the following result which 
is a direct generalisation of the Cauchy—Schwarz inequality (2.35) from 
Chapter 2. 


Lemma 9.1 (Cauchy—Schwarz inequality) 
KA OI SIF Mg Vig ev. (9.2) 


Proof The proof is analogous to that of (2.35). Recalling the definition 
of || - || from (9.1) and noting the first three axioms of inner product, we 
find that, for f,g € VY, 


O<|Af+ gl? = 1F1? +2A(f,9) + Ig? VA ER. (9.3) 


Denoting, for f,g € V fixed, the quadratic polynomial in X on the right- 
hand side by A(A), the condition for A(\) to be nonnegative for all \ in 
R is that [2(f,g9)]? —4|| f|l?l|g||? < 0; this gives the inequality (9.2). 


Now, putting A = 1 in (9.3) and using (9.2) on the right yields 
If+gll SIfl+igll Vi gev.- 
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Consequently, || - || obeys the triangle inequality, the third axiom of 
norm. The first two axioms of norm, namely that 


e ||f|| > 0 for all f € VY, and ||f|| = 0 if, and only if, f = 0 in V, and 
e ||Af|| = A || f|| for all Ac R and all f EV, 


follow directly form (9.1) and from the last three axioms of inner product 
stated in Definition 9.1. 
We have thus shown the following result. 


Theorem 9.1 An inner product space V over the field R of real numbers, 
equipped with the induced norm || - ||, is a normed linear space over R. 


We conclude this section with a relevant example of an inner product 
space, whose induced norm is the 2-norm considered at the beginning of 
Chapter 8. 


Example 9.3 The set Cla, b] of continuous real-valued functions defined 
on the closed interval [a,b] is an inner product space with 


b 
(f.9) = f w(x) Fo) g(oae, (9.4) 


where w is a weight function, defined, positive, continuous and inte- 
grable on the open interval (a,b). The norm || - |l2, induced by this 
inner product and given by 


i 1/2 
ile ( [ weoisePar) (9.5) 


is referred to as the 2-norm on Cla, b| (see Example 8.2). For the sake 
of simplicity, we have chosen not to distinguish in terms of our notation 
between the 2-norm on C[a, b] defined above and the 2-norm for vectors 
introduced in Chapter 2; it will always be clear from the context which 
of the two is intended. 


Clearly, it is not necessary to demand the continuity of the function 
f on the closed interval [a, b] to ensure that || f||2 is finite. For example, 
f: © sgn (x — $(a+6)), x € [a, 0], has finite 2-norm, despite the fact 
that it has a jump discontinuity at « = (a+ 6). 

Motivated by this observation, and the desire to develop a theory 
of approximation in the 2-norm whose range of applicability extends 
beyond the linear space of continuous functions on a bounded closed 
interval, we denote by L?,(a,b) the set of all real-valued functions f 
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defined on (a,b) such that w(a)|f(a)|? is integrable! on (a,b); the set 
L?.(a, b) is equipped with the inner product (9.4) and the induced 2-norm 
(9.5). Obviously, Cla, b] is a proper subset of L?,(a, b). 

In this broader context, || - ||2 is frequently referred to as the L?- 
norm; for the sake of simplicity we shall continue to call it the 2-norm. 
As before, w is assumed to be a real-valued function, defined, positive, 
continuous and integrable on the open interval (a,b). When w(x) = 1 
on (a,b), we shall write L?(a, b) instead of L?,(a, b). 

We are now ready to consider best approximation in the 2-norm. 


9.3 Best approximation in the 2-norm 


The problem of best approximation in the 2-norm can be formulated as 
follows: 


(B) Given that f € L?,(a,b), find py, € Pn such that 
If — Palla = inf If — alle 


such py is called a polynomial of best approximation of degree n 
to the function f in the 2-norm on (a,b). 


The existence and uniqueness of p, will be shown in Theorem 9.2. 
However, we shall first consider some simple examples. 


Example 9.4 Suppose that ¢ > 0 and let f(x) = 1—e-*/* with x 
in [0,1]. For « = 10~, the function f is depicted in Figure 9.1. We 
shall construct the polynomial of best approximation of degree 0 in the 
2-norm, with weight function w(x) =1, for f on (0,1), and compare it 
with the minimax polynomial of degree 0 for f on [0,1]. 


The best approximation to f by a polynomial of degree 0 in the 2-norm 
on the interval (0,1), with weight function w(x) = 1, is determined by 
minimising || f — c|l2 over all c € R; equivalently, we need to minimise 


[uw — eax =f \ia)Pae—2¢ f soya +e 


! Strictly speaking, the integral in the definition of || - ||2 should now be thought of as 
a Lebesgue integral, with the convention that any two functions in L?,(a,b) which 
differ only on a set of zero measure are identified. Readers who are unfamiliar with 
the concept of Lebesgue integral can safely ignore this footnote. For the definition 
of set of measure zero see Section 11.1 in Chapter 11. 
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14 


0.2 


Fig. 9.1. Graph of the function f: 2 1—e7*/* for r€ [0,1] and ¢ = 107°. 


over all c € R. The right-hand side is a quadratic polynomial in c; its 
minimum, as a function of c, is achieved for 


1 
c= | f(w)da =1—e+ee"*” 
0 


Consequently, the polynomial of degree 0 of best approximation to f 
in the 2-norm on the interval (0,1) with respect to the weight function 
w(x) = Lis 


(2-norm) 


Di, (c)=l—e+eeVF, x € [0,1]. 


On the other hand, since f € C[0,1] and f is strictly monotonic increas- 
ing on [0,1], its minimax approximation of degree 0 on the interval [0, 1] 
is simply the arithmetic mean of f(0) and f(1): 


phone) (a) = (1 = tte) , x € (0,1). 


Clearly, for 0 <e <1, per (x) = 1/2, while pP™"™ (x) ~ 1. 
An even more dramatic discrepancy is observed between the polyno- 
mials of best approximation in the 2-norm and the co-norm when 


f(z) =1—e Ve" 2/ , x € [0,1]. 
Here, for0<e <1, Cee (a) © 1, as before. On the other hand, 
pe PY Gr) =1- 1e-V2(1 fe Tey x € [0,1], 
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which tends to —co as € — 0+. These examples indicate that the 
polynomial of best approximation from P,, to a function in the 2-norm 
can be vastly different from the minimax approximation from P,, to the 
same function. © 


Given f € L?,(a,b), we shall assume for the moment the existence of 
an associated polynomial of best approximation in the 2-norm; later on 
we shall prove that such a polynomial exists and is unique. In order to 
motivate the general discussion that will follow, it is helpful to begin 
with a straightforward approach to a simple example. 

Let us suppose that we wish to construct the polynomial of best ap- 
proximation py, € Pn, n > 0, to a function f € L?,(0,1) on the interval 
(0,1) in the 2-norm; for simplicity, we shall assume that the weight 
function w(2) = 1. Writing the polynomial p,, as 


Dn(x) = cot ea t-+++en2", 


we want to choose the coefficients c;, 7 = 0,...,n, so as to minimise the 
2-norm of the error, e, = f — pn, 


llenll2 = If — Palle = (f |f(2) - pala)Pae) 


Since the 2-norm is nonnegative and the function € € Ry + €1/? is 
monotonic increasing, this problem is equivalent to one of minimising 
the square of the norm; thus, instead, we shall minimise the expression 


E(co,C1,---;€n) = [ W@) -pale)Pae 


by treating it as a function of (co,...,¢n). At the minimum, the partial 
derivatives of E with respect to the c;, 7 = 0,...,n, are equal to zero. 
This leads to a system of (n + 1) linear equations for the coefficients 


CO;-++5Cn: 


SMS oi SO tests (9.6) 
k=0 
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where 


1 
Myr = [ stPar= <q? 


II 
oO 
far 
SY 
— 
& 
8 
q, 
Q 
8 


bj 
Equivalently, recalling that the inner product associated with the 2-norm 
(in the case of w(x) = 1) is defined by 


1 
(ash) =f ala)h(a)ae, 
M;, and b; can be written as 
Myx — (x*, x?) ’ bj = (f, a) ‘ (9.7) 


By solving the system of linear equations (9.6) for co,..., Cn, we obtain 
the coefficients of the polynomial of best approximation of degree n to 
the function f in the 2-norm on the interval (0,1). We can proceed in 
the same manner on any interval (a,b) with any positive, continuous 
and integrable weight function w defined on (a,b). 

This approach is straightforward for small values of n, but soon be- 
comes impractical as n increases. The source of the computational dif- 
ficulties is the fact that the matrix M is the Hilbert matrix, discussed 
in Section 2.8. The Hilbert matrix is well known to be ill-conditioned 
for large n, so any solution to (9.6), computed with a fixed number of 
decimal digits, loses all accuracy due to accumulation of rounding errors. 
Fortunately, an alternative method is available, and is discussed in the 
next section. 


9.4 Orthogonal polynomials 


In the previous section we described a method for constructing the poly- 
nomial of best approximation p,, € P,, to a function f in the 2-norm; 
it was based on seeking py, as a linear combination of the polynomi- 
als a3, 7 = 0,...,n, which form a basis for the linear space P,. The 
approach was not entirely satisfactory because it gave rise to a system 
of linear equations with a full matrix that was difficult to invert. The 
central idea of the alternative approach that will be described in this 
section is to expand p, in terms of a different basis, chosen so that the 
resulting system of linear equations has a diagonal matrix; solving this 
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linear system is then a trivial exercise. Of course, the nontrivial ingredi- 
ent of this alternative approach is to find a suitable basis for P, which 
achieves the objective that the matrix of the linear system is diagonal. 
The expression for M;, in (9.7) gives us a clue how to proceed. 

Suppose that y;, 7 = 0,...,n, form a basis for Pn, n > 0; let us seek 
the polynomial of best approximation as 


Pn(x) = YoPo(x) + +++ + Yn¥n(Z), 


where Yo,..., Yn are real numbers to be determined. By the same process 
as in the previous section, we arrive at a system of linear equations of 
the form (9.6): 


n 
So Mire = B; j=O0,...,n, 
k=0 


where now 
Myx =(~r,.~53) and 8; =(f, 5), 
with the inner product (-,-) defined by 


b 
(gh) = / w(a)g(a)h() dx, 


and the weight function w assumed to be positive, continuous and inte- 
grable on the interval (a, b). 

Thus, M = (M;,) will be a diagonal matrix provided that the basis 
functions y;, 7 = 0,...,n, for the linear space P, are chosen so that 
(ve, 7) = 0, for 7 A k; in other words, vy, is required to be orthogonal 
to y; for j # k, in the sense of Definition 9.2. This observation motivates 
the following definition. 


Definition 9.4 Given a weight function w, defined, positive, continuous 
and integrable on the interval (a,b), we say that the sequence of polyno- 
mials ~;, j = 0,1,..., 7s a system of orthogonal polynomials on 
the interval (a,b) with respect to w, if each yp; is of exact degree j, and 
if 
b 
[we e@metoae {Fy Emeka 


Next, we show that a system of orthogonal polynomials exists on any 
interval (a, b) and for any weight function w which satisfies the conditions 
in Definition 9.4. We proceed inductively. 
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Let yo(x) = 1, and suppose that y; has already been constructed for 
j=0,...,n, with n > 0. Then, 


b 
/ w(a)or(a)o;(a)de=0, bE {0,...,n}\ {i}. 


Let us now define the polynomial 


apL 


g(x) = 2°" — aggo(x) — +++ — AnYn(2) , 


where 
= di w(x) "+19, (2)dz 
7 f° w(a)[p;(a) Pda 


7H Oye cag: 
It then follows that 


b 
/ w(a)q(a)y;(a)de 


I 


i “ w(a)a"™p;(2)da 
-a; f * w(2)[ej(2) 2de 


= 0 forO<j<n, 
where we have used the orthogonality of the sequence y;, 7 = 0,...,n. 
Thus, with this choice of the numbers a; we have ensured that q is 
orthogonal to all the previous members of the sequence, and y,y4+1 can 
now be defined as any nonzero-constant multiple of g. This procedure 


for constructing a system of orthogonal polynomials is usually referred 
to as Gram-Schmidt orthogonalisation.! 


Example 9.5 We shall construct a system of orthogonal polynomials 
{yo, "1, 92} on the interval (0,1) with respect to the weight function 
w(x) =1. 


We put Yo(z) = 1, and we seek vy in the form 
y1(x) = & — copo(2) 
such that (y1, 0) = 0; that is, 


(x, Po) — Co(Po, Yo) = 0. 


1 Jgrgen Pedersen Gram (27 June 1850, Nustrup, Denmark — 29 April 1916, Copen- 
hagen, Denmark); Erhard Schmidt (13 January 1876, Dorpat, Russia (now Tartu, 
Estonia) — 6 December 1959, Berlin, Germany). 
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Hence, 


and therefore, 


By construction, (y1, 0) = (Yo, ¥1) = 0. 
We now seek ye in the form 


a(x) = x” — (divi (x) + doyo(2)) 
such that (v2, 91) = 0 and (v2, yo) = 0. Thus, 


(x, p1) — di(yi, 91) — do(vo, 91) 0, 
(27, 0) — di(¥1, 90) — do(¥o, 0) = 0. 
As (%0, 91) = 0 and (v1, yo) = 0, we have that 
fig le, MERE) ag 
(g1.¢1) 
do = (x?, po) = 1 
(Yo, Yo) a 
and therefore 
yo(x) = 2? —a24+ q: (9.8) 


Clearly, (yx, ~;) = 0 for 7 #k, j,k € {0,1,2}, and y, is of exact degree 
j, j = 0,1,2. Thus we have found the required system {Yo, Y1, 2} of 
orthogonal polynomials on the interval (0,1) with respect to the given 
weight function w. 

By continuing this procedure, we can construct a system of orthog- 
onal polynomials {yo, Y1,---,; Yn}, with respect to the weight function 
w(x) = 1 on the interval (0,1), for any n > 1. For example, when n = 3, 
we shall find {yo, ¥1, 2, ys}, with yo, Y1, Y2, as above, and 


3 1 


y3(z) = 2° — 307 + 20-5. 


© 


Having generated a system of orthogonal polynomials on the interval 
(0,1) with respect to the weight function w(x) = 1, by performing the 
linear mapping x +> (b— a)x + a we may obtain a system of orthogo- 
nal polynomials on any open interval (a,b) with respect to the weight 
function w(a) = 1. For example, when (a,b) = (—1,1), the mapping 
x ++ 2a — 1 leads to the system of Legendre polynomials on (—1, 1). 
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Fig. 9.2. The first four Legendre polynomials on the interval (—1, 1). 


Example 9.6 (Legendre polynomials) We wish to construct a sys- 

tem of orthogonal polynomials on (a,b) = (—1,1) with respect to the 

weight function w(a) = 1. 

On replacing x by 
x-a 
b—a 

in yo(x), yi(x), yo(x), Y3(z) from Example 9.5, we obtain, on normal- 


ising each of these polynomials so that its value at x = 1 is equal to 1, 
the polynomials Yo, ~1, 92, 3, defined by 


= 4(r+1), x € (a,b) = (-1,1), 


yo(z) = 1, 

yi(2) z, 

po(2) aa = 5 ’ 
gs(t) = $a — Se. 


These are the first four elements of the system of Legendre polynomials, 
orthogonal on the interval (—1,1) with respect to the weight function 
w(x) = 1. They are depicted in Figure 9.2. An alternative normalisation 
would have been to divide each y; by ||y;||2 so as to ensure that the 


2-norm of the resulting scaled polynomial is equal to 1. © 
Example 9.7 The Chebyshev polynomials Ty: x ++ cos(ncos ' 2), 
n = 0,1,..., introduced in Section 8.4, form an orthogonal system on 


the interval (—1,1) with respect to the positive, continuous and integrable 


weight function w(x) = (1 — g?)\-1/2, 
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The proof of this is simple: let (- ,-) denote the inner product in L?,(—1, 1) 
with w = (1 — «?)~1/?. By using the change of independent variable 


t € (0,7) « =cost € (-1,1), 


we have 
yt 
(Lingln). = i ——— (cos mcos~! x) (cosn cos”! x) dx 
ae VJ1 = ae 
7 
= | cos mt cos nt dt 
0 
TT 
= 4 / {cos(m + n)t + cos(m — n)t} dt 
0 
_ 0 when m4 n, 
a a when m=n, 
for any pair of nonnegative integers m and n. © 


We are now ready to prove the existence and uniqueness of the poly- 
nomial of best approximation in the 2-norm. In particular, the next 
theorem shows that the infimum of || f — q||2 over g € Pn in problem (B) 
is attained and can be replaced by a minimum over g € Py. 


Theorem 9.2 Given that f € L?,(a,b), there exists a unique polynomial 
Pn € Pn such that || f — pnll2 = mingep, || f — alle. 


Proof In order to simplify the notation, we recall the definition of the 
inner product (-,-): 


b 
(g,h) = / w( ae) g(«)h(a)der, 


and note that the induced 2-norm, || - ||2, is defined by 
lIgllz = (9, 9)”. 
Suppose that y;, 7 = 0,...,n, is a system of orthogonal polynomials 
with respect to the weight function w on (a,b). Let us normalise the 
polynomials y; by defining a new system of orthogonal polynomials, 


= Pil) =, 
w; (x) leslle’ J ° nr 
Then, 
_jJi1, g=k, 
Wats) ={ Q) ee 
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Such a system of polynomials is said to be orthonormal. The polyno- 
mials w;, 7 = 0,...,7, are linearly independent and form a basis for the 
linear space P,,; therefore, each element g € Py, can be expressed as a 
suitable linear combination, 


a(x) = BoWo(x) +--+ + Batn(2)- 
We wish to choose 3;, 7 = 0,...,, so as to ensure that the correspond- 
ing polynomial q minimises || f — q||3 over all g € Pn. Let us, therefore, 
consider the function E: (Go,..-,;8n) € R°*!  E(6o,..-, Gn) defined 
by E(Bo,--++8n) = If — all, where q(x) = Sovo(a) + +--+ Bnbn(2). 
Then, 
FA Doy. eccpOn) = (f-af-4) 


= |fls—255 6 (f.45) + 5 S> By Ba lbe, by) 


j=0 j=0 k=0 
= Wfll2- 20 6s (fdi) + 05; 
j=0 j=0 
= S016; - Fear + Fa — SOMA ey)? 
ga0 j=0 
The function ((G0,..., Gn) > E(Go0,.-., Gn) achieves its minimum value 


at (G%,..., 8%), where 
By =(F,07), J =0,...,n. 
Hence pn € Pn defined by 
Pn(&) = Borbo(2) +--+ + Brdn(x) 


is the unique polynomial of best approximation of degree n to the func- 
tion f € L?2,(a,b) in the 2-norm on the interval (a, b). 


Remark 9.1 As E(8%,..., 8%) = || f—pn||3 => 0, it follows from the proof 
of Theorem 9.2 that if f € L2,(a,b), and {Wo,W1,-..} is an orthonormal 
system of polynomials in L?,(a,b), then 


SONA es)? SIIB 


j=0 
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Fig. 9.3. Illustration of the orthogonality property (f — pn,q) = 0 for all g in 
Pn, expressing the fact that if p, € P» is a polynomial of best approximation 
to f € L2,(a,b) in the 2-norm, then the error f — pp is orthogonal, in L?,(a, b), 
to all elements of the linear space P,. The 0 in the figure denotes the zero 
element of the linear space P, (and, simultaneously, that of L2,(a,b)), namely 
the function that is identically zero on the interval (a, b). 


for eachn > 0. This result is known as Bessel’s inequality.' 


The next theorem, in conjunction with the use of orthogonal poly- 
nomials, will be our key tool for constructing the polynomial of best 
approximation in the 2-norm. 


Theorem 9.3 A polynomial pp, € Py is the polynomial of best approxi- 
mation of degree n to a function f € L?,(a,b) in the 2-norm if, and only 
if, the difference f — py is orthogonal to every element of Pn, 1.€., 


(f-—Pn,g)=0 VqeEPn. (9.9) 


A geometrical illustration of the property (9.9) is given in Figure 9.3. 


Proof of theorem Suppose that (9.9) holds. Then, 


(f—PnPa—-Q@=0 Vqe Pr, 


given that p, — q € Pn for each gq in Py. Therefore, 


lf — palld = (f — Pn, f — Pn) 
(f= Daf —@) +4F = Pasd = Pr) 
= (f-—pn,f—-—49) VqePn- 


1 Friedrich Wilhelm Bessel (22 July 1784, Minden, Westphalia, Holy Roman Empire 
(now Germany) — 17 March 1846, Konigsberg, Prussia (now Kaliningrad, Russia)). 
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Hence, by the Cauchy—Schwarz inequality (9.2), 
If — Palla <Ilf-pallelf—alle  Vae Pn. 
This implies that 
lf —Prll2<\lf-alle VaePn. 


On choosing g = pr on the right-hand side, equality will hold and there- 
fore 


If — Palle = on Il.f — alle - 
Conversely, suppose that p,, is the polynomial of best approximation 


to f € L?2,(a,b). We have seen in the proof of Theorem 9.2 that py, can 
be written in terms of the orthonormal polynomials Wz, k = 0,...,n, as 


Pn(a) = Bovo(a) +--+ + Brtn(x), 
where 
Ge = (Fv), k=0,...,n. (9.10) 
On recalling that (v%,,~;) = jx, 7,&k € {0,...,n}, where 6;, is the 
Kronecker delta, we deduce from (9.10) that 


(f—pnibj) = (f,¥5) SoBe 


(f,%;) Do 


(f.43) —B8 a j=0,...,n. (9.11) 


Since P,, = span{wo,.--, Wn}, it follows from (9.11) that (f —p,,q) =0 
for all g € Pn, as required. 


An equivalent, but slightly more explicit, form of writing (9.9) is 


b 
[ @E@) - pale)yale) ax =0 Vg Pn. 


Theorem 9.2 provides a simple method for determining the polynomial 
of best approximation py, € P, to a function f € L?,(a,b) in the 2-norm. 
First, proceeding as described in the discussion following Definition 9.4, 
we construct the system of orthogonal polynomials y;, 7 = 0,...,n, on 
the interval (a,b) with respect to the weight function w, if this system 
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is not already known. We normalise the polynomials y;, 7 = 0,...,n, 
by setting 
Pj . 
v=, j=O0,...,7, 
IIealle 
to obtain the system of orthonormal polynomials 7;, 7 = 0,...,n, on 


(a,b). We then evaluate the coefficients 67 = (f,~;), 7 =0,--.,n, and 
form pn(x) = Bobo(x) + +++ + Brvn(a). 

We may avoid the necessity of determining the normalised polynomials 
w,; by writing 


Pn(z) = Bovo(z) +--+ + Givn(z) 
85 (0, Po)? yo(x) +--+ + BR Gn, Pn) pn(2) 


= Yyo¥o(x) aut aiaes YnPn(2) , (9.12) 
where 
4 = FPP eres oi, (9.13) 
(25,93) 


Thus, as indicated at the beginning of the section, with this approach 
to the construction of the polynomial of best approximation in the 2- 
norm, we obtain the coefficients y; explicitly and there is no need to 
solve a system of linear equations with a full matrix. 


Example 9.8 We shall construct the polynomial of best approximation 
of degree 2 in the 2-norm to the function f: « > e* over (0,1) with 
weight function w(x) =1. 


We already know a system of orthogonal polynomials yo, Y1, Y2 on this 
interval from Example 9.5; thus, we seek pz € P2 in the form 


pa(x) = yoPo(x) + V1"1(%) + Y22(2) , (9.14) 
where, according to (9.13), 


_ ite e” yp; (x)dz 
ihe ys (x)dax 


Recalling from Example 9.5 that 


ij ; 7=0,1,2. 


yo(z) =1, gi(a)=2- 5, yo(x) =x —@+%, 


we then have that 
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OS ea =e— 1, 
3/2—e/2 
_ Te/6 — 19/6 _ 
na = gy = 2100 - 570. 


Substituting the values of yg, 7, and ‘2 into (9.14), we conclude that the 
polynomial of best approximation of degree 2 for the function f: 7 ++ e” 
in the 2-norm is 


p2(x) = (210e — 570)? + (588 — 216e)x + (39e — 105). 
The approximation error is 
I f — pall2 = 0.005431 , 
to six decimal digits. © 


We conclude this section by giving a property of orthogonal polyno- 
mials that will be required in the next chapter. 


Theorem 9.4 Suppose that y;, 7 =0,1,..., 1s a system of orthogonal 
polynomials on the interval (a,b) with respect to the positive, continuous 
and integrable weight function w on (a,b). It is understood that yp; is a 
polynomial of exact degree 7. Then, for 7 > 1, the zeros of the polynomial 
y; are real and distinct, and lie in the interval (a,b). 


Proof Suppose that €;,i1 =1,...,k, are the points in the open interval 
(a,b) at which y;(x) changes sign. Let us note that k > 1, because for 
j = 1, by orthogonality of y;(x) to yo(x) = 1, we have that 


| WOSORSe. 


Thus, the integrand, being a continuous function that is not identically 
zero on (a, 6), must change sign on (a, b); however, w is positive on (a, 6), 
so y; must change sign at least once on (a,b). Therefore k > 1. 

Let us define 


T(x) = (aw — 1)... (@ — Ex). (9.16) 
Now the function y;(x)7,(x) does not change sign in the interval (a, 0), 
since at each point where y,;(x) changes sign 7,(x) changes sign also. 
Hence, 


b 
[ea esam(aar 40. 
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However, y; is orthogonal to every polynomial of lower degree with 
respect to the weight function w, so the degree of the polynomial 7, 
must be at least 7; thus, k > 7. On the other hand, & cannot be greater 
than j, since a polynomial of exact degree 7 cannot change sign more 
than j times. Therefore k = 7; i.e., the points € € (a,b), i=1,...,9, 
are the zeros (and all the zeros) of y;(z). 


9.5 Comparisons 


We can show that the polynomial of best approximation in the 2-norm 
for a function f € Cla, b] is also a near-best approximation in the oo- 
norm for f on [a,b] in the sense defined in Section 8.5. 


Theorem 9.5 Let n > 0 and assume that f is defined and continuous 
on the interval [a,b], and f € Pn. Let py be the polynomial of best 
approximation of degree n to f in the 2-norm on [a,b], where the weight 
function w is positive, continuous and integrable on (a,b). Then, the 
difference f — pp changes sign at no less than n+ 1 distinct points in 
the interval (a, b). 


Proof The proof is very similar to that of Theorem 9.4; we shall give an 
outline and leave the details as an exercise. 
As (f — pn, 1) = 0, ée., 


b 
/ w(a)(F(#) — pa(a))de = 0, 


and w(x) > 0 for all x € (a,b), it follows that f — p, changes sign in 
(a,b). Let €;, 7 =1,...,k, denote distinct points in (a,b) where f — pn 
changes sign. We shall prove that k >n+1. 

Define the polynomial 7;,() as in (9.16); then, w(x)[f(x)—ppy (x) |, (x) 
does not change sign in (a,b), and so its integral over (a,b) is not zero. 
Therefore, (f — pn, 7%) #0. On the other hand, according to Theorem 
9.3, f — pn is orthogonal to every polynomial of degree n or less. Hence 
the degree of 7,(x) must be greater than n, and sok >n-+1. 


We return to the example illustrated by Figure 8.5, and consider the 
difference f — py for the function f: x +> e?* on the interval (0,1). Fig- 
ure 9.4 shows this difference for two polynomial approximations of degree 
4: the minimax approximation of Section 8.5 and the best approxima- 
tion in the 2-norm with weight function w(x) = 1. It is clear that the 
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Fig. 9.4. The difference e?” — pa(x) for two polynomial approximations of 
degree 4 on [0,1]. Thin curve — minimax approximation; thick curve — best 
approximation in the 2-norm with weight function w(x) = 1. 


Fig. 9.5. The difference e?” — pa(x) for two polynomial approximations of 
degree 4 on [0,1]. Thin curve — minimax approximation; thick curve — best 


approximation in the 2-norm with weight function w(x) = [a(1 — x)|71/?. 


error of the 2-norm approximation has the right number of alternating 
local maxima and minima, and is a near-minimax approximation from 
P, to f on [0,1]; but the extrema at the ends of the interval are signif- 
icantly larger than the internal extrema. If we use a weight function w 
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which gives greater weight near the ends of the interval, it seems likely 
that the extrema of the error might be more nearly equal. This can be 
achieved by using the weight function w(x) = [#(1—«)|~'/?, so that the 
orthogonal polynomials are the Chebyshev polynomials adapted to the 
interval (0,1). Figure 9.5 shows the corresponding difference f—p,, and 
we now see that the two best approximations, in the oo-norm and the 
weighted 2-norm, are very close. 

Polynomials of best approximation in the 2-norm have a special prop- 
erty which is often useful. Suppose that we have constructed the best 
polynomial approximation, p,, of degree n, in the 2-norm, but that py, 
does not achieve the required accuracy. To construct the best poly- 
nomial approximation of degree n + 1 all we need is to calculate Yn41 
from 
(f = Pns Pn+1) 


Yn+1 = 
‘i IlPn+a 15 


and then let pyii(@) = pn(@) + Intivn+i(x). By noting that 
(f —Dnsir Oj) = 05 j=0,1,...,n+1, 


it follows that pn+1 is best least squares approximation to f from Py+1. 
If we are constructing the minimax approximation of degree n+ 1, or 
using Lagrange interpolation with equally spaced points, the work in- 
volved in constructing pn, is lost, and the construction of p,+, must 
begin completely afresh. 


9.6 Notes 


We give some pointers to the vast literature on orthogonal polynomials. 
The following are classical sources on the subject. 


» GEZA FREUND, Orthogonal Polynomials, Pergamon Press, Oxford, 
New York, 1971. 

» PauL NEval, Orthogonal Polynomials, Memoirs of the American 
Mathematical Society, no. 213, American Mathematical Society, Prov- 
idence, RI, 1979. 

» GABOR SZEGO, Orthogonal Polynomials, Colloquium publications 
(American Mathematical Society), 23, American Mathematical So- 
ciety, Providence, RI, 1959. 


Tables of orthogonal polynomials are found in 
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> M. ABRAMOWITZ AND I.A. STEGUN (Editors), ‘Orthogonal polyno- 
mials’, Ch. 22 in Handbook of Mathematical Functions with Formulas, 
Graphs, and Mathematical Tables, ninth printing, Dover, New York, 
pp. 771-802, 1972. 


Computational aspects of the theory of orthogonal polynomials are dis- 
cussed in the edited volume 


> W. GautTscHI, G.H. GOLUB, AND G. OPFER (Editors), Applica- 
tions and Computation of Orthogonal Polynomials, Conference at the 
Mathematical Research Institute, Oberwolfach, Germany, March 22-— 
28, 1998, Birkhauser, Basel, 1999. 


A recent survey of the theory and application of orthogonal polynomials 
in numerical computations is contained in 


» W. GauTscuHI, Orthogonal polynomials: applications and computa- 
tion, Acta Numerica 5 (A. Iserles, ed.), Cambridge University Press, 
Cambridge, pp. 45-119, 1996. 


Finally, we refer to the books of Powell and Cheney, cited in the 
Notes at the end of the previous chapter, concerning the application of 
orthogonal polynomials in the field of best least squares approximation. 


Exercises 
9.1 Construct orthogonal polynomials of degrees 0, 1 and 2 on the 
interval (0,1) with the weight function w(x) = —Inz. 
9.2 Let the polynomials y;, 7 = 0,1,..., form an orthogonal sys- 


tem on the interval (—1,1) with respect to the weight function 
w(a) = 1. Show that the polynomials y,;((2x — a — b)/(b—a)), 
j = 0,1,..., represent an orthogonal system for the interval 
(a,b) and the same weight function. Hence obtain the polyno- 
mials in Example 9.5 from the Legendre polynomials in Example 
9.6. 

9.3 Suppose that the polynomials y;, 7 = 0,1,..., form an orthog- 
onal system on the interval (0,1) with respect to the weight 
function w(z) = «*, a > 0. Find, in terms of y,;, a system 
of orthogonal polynomials for the interval (0,b) and the same 
weight function. 
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Show, by induction or otherwise, that, for0<k <n, 


d : 2\n 2\n—k 
ai (l— 2°)" = (1— 2°)" "qe(z), 
av 

where gq, is a polynomial of degree k. Deduce that all the deriva- 
tives of the function (1 — 2?)” of order less than n vanish at 
r= +1. 

Define vy; (x) = (d/da)/(1 — x?)’, and show by repeated inte- 
gration by parts that 


1 
‘i pr(x)y;(a)dz =0, O<k<j. 
-1 


Hence verify the expressions in Example 9.6 for the Legendre 
polynomials of degrees 0, 1, 2 and 3. 
Show, by induction or otherwise, that, for 0 < k < j, 


a" 4 
(<:) wle* = a an(a)e*, 
where q;(x) is a polynomial of degree k. 
The function vy; is defined for 7 > 0 by 


Show that, for each j > 0, y; is a polynomial of degree j, 
and that these polynomials form an orthogonal system on the 
interval (0, co) with respect to the weight function w(x) = e~*. 
Write down the polynomials with 7 = 0,1,2 and 3. 

Suppose that y;, 7 = 0,1,..., form a system of orthogonal poly- 
nomials with weight function w(x) on the interval (a,b). Show 
that, for some value of the constant Cj, yj41(x) — Cjxy;(x) is 


a polynomial of degree 7, and hence that 


j 
yj41(2) — Cyr; (x) = J ajeyr(z), age ER. 
k=0 


Use the orthogonality properties to show that aj, = 0 for 
k <7 —1, and deduce that the polynomials satisfy a recurrence 
relation of the form 


pj4i(x) — (Cjx t+ Dj)o;(x) + Ejpj-1(@) =0, 9g 21. 


9.7 


9.8 


9.9 


9.10 
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In the notation of Exercise 6 suppose that the normalisation of 
the polynomials is so chosen that for each j the coefficient of x 
in y;(«) is positive. Show that C; > 0 for all 7. By considering 


b 
ih w(2)94(a) [94(2) — Cj129j-1(@)]de 
show that 
b 
w(x) ayj—1(x)p;(x)dx >0, 


and deduce that #; > 0 for all 7. Hence show that for all 
positive values of j the zeros of y; and y,;—1 interlace. (See the 
proof of Theorem 5.8.) 

Using the weight function w on the interval (a, b) apply a similar 
argument to that for Theorem 8.6 to find the best polynomial 
approximation p, of degree n in the 2-norm to the function 
x1, Show that 


b 
lle" — pall = / w( ap? , .de/[ertt) , 
a 


n+1 


where eo is the coefficient of x in Yn4i(2). 

Write down the best polynomial approximation of degree 2 

to the function x? in the 2-norm with w(x) = 1 on the interval 
(—1, 1), and evaluate the 2-norm of the error. 
Suppose that the weight w is an even function on the interval 
(—a,a), and that a system of orthogonal polynomials y;, 7 = 
0,...,m, on the interval (—a,a) is constructed by the Gram-— 
Schmidt process. Show that, if 7 is even, then y; is an even 
function, and that, if 7 is odd, then y; is an odd function. 

Now suppose that the best polynomial approximation of de- 
gree n in the 2-norm to the function f on the interval (—a, a) is 
expressed in the form 


Pn(x) = Yo~o(Z) + +++ + InPn(z) - 


Show that if f is an even function, then all the odd coefficients 
y2j-1 are zero, and that if f is an odd function, then all the 
even coefficients ya; are zero. 

The function H(a) is defined by H(z) = 1 if > 0, and 
H(—x) = —H(x). Construct the best polynomial approxima- 
tions of degrees 0, 1 and 2 in the 2-norm to this function over 
the interval (—1,1) with weight function w(x) = 1. (It may not 
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appear very useful to consider a polynomial approximation to 
a discontinuous function, but representations of such functions 
by Fourier series will be familiar to most readers. Note that the 
function H belongs to L?,(—1, 1).) 


10 


Numerical integration — II 


10.1 Introduction 


In Section 7.2 we described the Newton—Cotes family of formulae for nu- 
merical integration. These were constructed by replacing the integrand 
by its Lagrange interpolation polynomial with equally spaced interpo- 
lation points and integrating this exactly. Here, we consider another 
family of numerical integration rules, called Gauss quadrature formulae, 
which are based on replacing the integrand f by its Hermite interpo- 
lation polynomial and choosing the interpolation points x; in such a 
way that, after integrating the Hermite polynomial, the derivative val- 
ues f’(x;) do not enter the quadrature formula. It turns out that this 
can be achieved by requiring that the x; are roots of a polynomial of a 
certain degree from a system of orthogonal polynomials. 


10.2 Construction of Gauss quadrature rules 


Suppose that the function f is defined on the closed interval [a,b] and 
that it is continuous and differentiable on this interval. Suppose, further, 
that w is a weight function, defined, positive, continuous and integrable 
on (a,b). We wish to construct quadrature formulae for the approximate 
evaluation of the integral 


b 
a 
For a nonnegative integer n, let x;, 7 = 0,...,n, be nm +1 points in 


the interval [a, b]; the precise location of these points will be determined 
later on. The Hermite interpolation polynomial of degree 2n + 1 for the 
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function f is given by the expression (see Section 6.4) 


Pani (a = > tal) Pax) +32 Ke(a)f"(ee), (10.1) 
k=0 


where 
Hy(z) = [Le(x)]?(1— 2L4(2x)(x — 2x), 
Ky(z) = [Lx(x)]?(@ — 2x). 

Further, for n > 1, Ly € Py is defined by 


(10.2) 


if n = 0, we let Lo(a) = 1 and thereby Hp(x) = 1 and Ko(x2) = x — xo 
for this value of n. Thus, we deduce from (10.1) that 


[oweseae ~ f° waipountoyas 


l 


by Wif (te) + ye Vif’ (te) ; (10.3) 
=0 k=0 
where 


b b 
w= f w(a) Ay (a)da , m= f w(x) K;,(a)da. 


There is an obvious advantage in choosing the points x, in such a way 
that all the coefficients V;, are zero, for then the derivative values f’(x,) 
are not required. Recalling the form of the polynomial K;, and inserting 
it into the defining expression for V,, we have 


b 
Vk =, w(2)[Lx (0) ?(@ — ay)ae 


= Cp fo 2) tn41(t)Lx(x)de , (10.4) 


where 741(%) = ( — 2)... (@ — pn) and 
C= (Teo cn (te - a>) ifs Ii, 
1 ifn=0. 


Since 7,41 is of degree n +1 while Lyz(x) is of degree n for each k, 
0<k <n, each Vz will be zero if the polynomial 7,41 is orthogonal 
to every polynomial of lower degree with respect to the weight function 
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w. We can therefore construct the required quadrature formula (10.3) 
with V, = 0, k = 0,...,n, by choosing the points x,, k = 0,...,n, to 
be the zeros of the polynomial of degree n + 1 in a system of orthogonal 
polynomials over the interval (a,b) with respect to the weight function 
w; we know from Theorem 9.4 that these zeros are real and distinct, and 
all lie in the open interval (a, 6). 

Having chosen the location of the points 7,, we now consider Wx: 


b 
Ww, = jf w@ Hoar 
b 
= / ww(a)[Le(a)|2(1 — 204 (a4) (@ — ee) de 


b 
= ‘ w(x) [Lp (a) |?dax — 2L},(ap) Vi - (10.5) 


Since V, = 0, the second term in the last line vanishes and thus we 
obtain the following numerical integration formula, known as the Gauss 
quadrature! rule: 


b n 
[wy steyae ~ on) = 0 Wet on), (106) 
@ k=0 
where the quadrature weights are 
wee i “oa (10.7) 
and the quadrature points s k =0,...,n, are chosen as the zeros of 


the polynomial of degree n+ 1 from a system of orthogonal polynomials 
over the interval (a,b) with respect to the weight function w. Since 
this quadrature rule was obtained by exact integration of the Hermite 
interpolation polynomial of degree 2n + 1 for f, it gives the exact result 
whenever f is a polynomial of degree 2n + 1 or less. 


Example 10.1 Consider the case n = 1, with the weight function 
w(x) = 1 over the interval (0,1). 


The quadrature points xo, x; are then the zeros of the polynomial yo 
constructed in Example 9.5 and given by (9.8), 


yo(a) =a? —2+ 5, (10.8) 


1 Carl Friedrich Gauss, Methodus nova integralium valores per approximationem 
inveniendi, 1814. 


280 10 Numerical integration — IT 
and therefore 
so=t-Vby madtve. 


Clearly, xo and x, belong to the open interval (0,1), in accordance with 
Theorem 9.4. The weights are obtained from (10.7): 


sy Ae ee ie Ve 
Wo = f=) dx 
0 TG — @1 
i 
= 3 | (a? — 2ayx + x2) dx 
0 


= 3(3 — £1 +27) 
. (10.9) 


NIR 


and W,; = $ in the same way. We thus have the Gauss quadrature rule 


1 

[ seyae = 49h - Va) + G+ V8), (10.10) 
which is exact whenever f is a polynomial of degree 2 x 1+ 1 = 3 or 
less. © 


10.3 Direct construction 


The calculation of the weights and the quadrature points in a Gauss 
quadrature rule requires little work when the system of orthogonal poly- 
nomials is already known. If this is not known, at the very least it is 
necessary to construct the polynomial from the system whose roots are 
the quadrature points; in that case a straightforward approach, which 
avoids this construction, may be easier. 

Suppose, for example, that we wish to find the values of Ap, Ai, xo 
and x; such that the quadrature rule 


| fla)de ~ Ag f (ao) + Arf (a1) (10.11) 


is exact for all f € P3. 

We have to determine four unknowns, Ap, A;, % and x71, so we need 
four equations; thus we take, in turn, f(z) = 1, f(x) = 2, f(x) = 2? 
and f(x) = 2° and demand that the quadrature rule (10.11) is exact 
(that is, the integral of f is equal to the corresponding approximation 
obtained by inserting f into the right-hand side of (10.11)). Hence, 


1 = Ap+A1, (10.12) 
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5 AoXo + Aix 5 (10.13) 
§ = Aoro+ Aix, (10.14) 
4 = Aor§+ Air}. (10.15) 


It remains to solve this system. To do so, we consider the quadratic 
polynomial 72 defined by 


1o(x) = (@ — x0)(a — 21) 
whose roots are the unknown quadrature points 79 and x;. In expanded 
form, 72(a) can be written as 
n(x) = 22+ pret. 


First we shall determine p and q; then we shall find the roots xg and 
x1 of 72. We shall then insert the values of xo and 21 into (10.13) and 
solve the linear system (10.12), (10.13) for Ag and Aj. 

To find p and q, we multiply (10.12) by gq, (10.13) by p and (10.14) by 
1, and we add up the resulting equations to deduce that 


s+4hp+q = Ao(x}+pxo t+ q) + Ai(az + pai + q) 
Aoma(x0) + AjT9(21) = Ao : 0 + Aj ; 0 = 0 . 


Therefore, 
s+5p+q=0. (10.16) 
Similarly, we multiply (10.13) by q, (10.14) by p and (10.15) by 1, and 
we add up the resulting equations to obtain 
at+4ap+hq = Aoxo(xp t+ pro +9) + Aizi(ay + pri +4) 
= Agxo72(20) + Aj 21 72(£1) = Ap:04+ A;:0=0. 
Thus, 
4+ 4pt+4q=0. (10.17) 
From (10.16) and (10.17) we immediately find that p = —1 and q = §. 


Having determined p and gq, we see that 


T2(x) =a -a+:, 
in agreement with (10.8). We then find the roots of this quadratic poly- 
nomial to give xp and x; as before. With these values of xp and x1 we 


deduce from (10.12) and (10.13) that 
Ao + Ay 
Ao(5 + Vz) Ai(3 Vp) = 0, 


| 
= 
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and therefore Ag = A, = 5. Thus, we conclude that the required 
quadrature rule is (10.10), as before. 

It is easy to see that equations (10.16) and (10.17) express the condi- 
tion that the polynomial x? + px + q is orthogonal to the polynomials 
1 and x respectively. This alternative approach has simply constructed 
a quadratic polynomial from a system of orthogonal polynomials by re- 
quiring that it is orthogonal to every polynomial of lower degree, instead 
of building up the whole system of orthogonal polynomials. 

A straightforward calculation shows that, in general, the quadrature 
rule (10.10) is not exact for polynomials of degree higher than 3 (take 
f(x) = «+, for example, to verify this). 


Example 10.2 We shall apply the quadrature rule (10.10) to compute 
an approximation to the integral I = ifs e* dz. 


Using (10.10) with f(a) = exp(x) = e® yields 


r= jo (3 V3) + $exp (3 Vs) = Vecosh \/4. 


On rounding to six decimal digits, I + 1.717896. The exact value of the 
integral is J = e — 1 = 1.718282, rounding to six decimal digits. © 


10.4 Error estimation for Gauss quadrature 


The next theorem provides a bound on the error that has been commit- 
ted by approximating the integral on the left-hand side of (10.6) by the 
quadrature rule on the right. 


Theorem 10.1 Suppose that w is a weight function, defined, integrable, 
continuous and positive on (a,b), and that f is defined and continuous 
on [a,b]; suppose further that f has a continuous derivative of order 
2n+2 on [a,b], n>0. Then, there exists a number n in (a,b) such that 


b n 
[waft -— SP Wi Flex) = Knf™™(m), (10.18) 


k=0 
and 
ky = Lf mele (x)/?dx 
"™ Qn+2)! J, ae 
Consequently, the integration formula (10.6), (10.7) will give the exact 
result for every polynomial of degree 2n + 1. 
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Proof Recalling the definition of the Hermite interpolation polynomial 
Pon+1 for the function f and using Theorem 6.4, we have 


b n b 
i w(x) f(a)da — S> Wes (are) i ENCE) Spires 


k=0 


[Tn+1(x)]*da. 


e [ o@ Eee) 
e (10.19) 


(2n + 2)! 


However, by the Integral Mean Value Theorem, Theorem A.6, the last 
term is equal to 


(2n+2) b 
aaa), w(e) [tne (@)]?de, 


for some 77 € (a,b), and hence the desired error bound. 


Note that, by virtue of Theorem 10.1, the Gauss quadrature rule gives 
the exact value of the integral when f is a polynomial of degree 2n + 1 
or less, which is the highest possible degree that one can hope for with 
the 2n + 2 free parameters consisting of the quadrature weights W,, 
k =0,...,n, and the quadrature points xz, k =0,...,n. 

A different approach leads to a proof of convergence of the Gauss 
formulae G,,(f), defined in (10.6), (10.7), as n — oo. 


Theorem 10.2 Suppose that the weight function w is defined, positive, 
continuous and integrable on the open interval (a,b). Suppose also that 
the function f is continuous on the closed interval [a,b]. Then, 


b 
lim Gn(f) =} w(x) f(x)da . 


n—oo 


Proof If we choose any positive real number €o then, since f is continuous 
on [a,b], the Weierstrass Theorem (Theorem 8.1) shows that there is a 
polynomial p such that 


| f(z) — p(a)|<eo forall € [a,b]. (10.20) 


Let N be the degree of this polynomial, and write p as py. 
Thus we deduce that 


b b 
/ w(a)f(a)de—Ga(f) = i w()[f(@) — pw (@)]de 


b 
is / OCLC meCaTe 
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Consider the first term on the right of this equality; it follows from 
(10.20) that 


<eoW, 


b 
‘| w( 2) f(a) — pa (a) de 


where 
b 
wf w(x)da. 


For the last term on the right of (10.21), 


IGn(f) —Gn(pn)| < >—|Welf (ee) — pw(xe)]| 
k=0 
co) Wr 
k=0 


b 
= cof w(x)dx 
= ew, (10.22) 


IA 


where we have used the fact that all the quadrature weights W; are pos- 
itive (see (10.7)), and that a Gauss quadrature rule integrates a constant 
function exactly. Now for the middle term in (10.21), if we define No 
to be the integer part of sN, we see that when n > No the quadrature 
formula is exact for all polynomials of degree 2No + 1 or less, and hence 
for the polynomial py (given that N <2No+1<2n-+1). Therefore, 


b 
i w(x)pn(x)dx —Gr(pn) =0 ifn> No. 


Putting these three terms together, we see that 


b 
[ we@fteyae -a(1) <e9W+0+e0W ifn> No. 


Finally, given any positive number e¢, we define €9 = ¢/(2W) and find 
the corresponding value of No = No(e) to deduce that 


<eé ifn > No, 


b 
/ w(x) f(a)de — Gal f) 


which is what we were required to prove. 
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The interest of this theorem is mainly theoretical, as it gives no in- 
dication of how rapidly the error tends to zero. However, it does show 
the importance of the fact that the weights W; are positive. Much of 
the above proof would apply with little change to the Newton—Cotes 
formulae of Section 7.2. We saw there that for the formulae of order 1 
and 2, the trapezium rule and Simpson’s rule, the weights are positive. 
However, when n > 8 some of the weights in the Newton—Cotes formula 
of order n become negative. In this case we have }>;_, We = (b— a), 
but we find that }>;/'9 |Ws| = co as n — ov, so the proof breaks down. 
Stronger conditions must be imposed on the function f to ensure that 
the Newton—Cotes formula converges to the required integral. (See the 
example in Section 7.4.) 


10.5 Composite Gauss formulae 


It is often useful to define composite Gauss formulae, just as we did for 
the trapezium rule and Simpson’s rule in Section 7.5. Let us suppose, 
for the sake of simplicity, that w(a) = 1. We divide the range [a, }] 
into m subintervals [x;_1,2,;], 7 = 1,2,...,m, m > 2, each of width 
h = (b—a)/m, and write 


i, Fees y i - fae, 


where 
x;=a+t+ jh, j =0,1,...,m. 


We then map each of the subintervals [z;-1,2,;], 7 = 1,2,...,m, onto 
the reference interval [—1,1] by the change of variable 


a= 4(aj-1+2;) + 5ht, t € [-1,1], 
giving 
b m 1 m 
[ feoez=y f goae= pron, 
a j=l -1 j=l 
where 


1 
gj(t) = f ($(aj-1 +. 2j) + ht) and J; = fasta. 


The composite Gauss quadrature rule is then obtained by applying 
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the same Gauss formula to each of the integrals J;. This gives 
b 
[fer ~ LY WeailG) 


sh Wal (3 (aj-1 +. &;) + $hEx) 
j=1 k=0 
(10.23) 
where &, are the quadrature points in (—1,1) and W, are the associated 
weights for k = 0,...,n with n > 0. 

An expression for the error of this composite formula is obtained, as 
in Section 7.5, by adding the expressions (10.18) for the errors in the 
integrals J;. The result is 
(b = aj 


Enm = On 548m anF2On LD) 


co) (10.24) 


where 7 € (a,b) and 


n= f [m+1(t)]?de. 


-1 


Definition 10.1 The composite midpoint rule is the composite Gauss 
formula with w(x) =1 and n = 0 defined by 


b m 
/ f(a)dx & hy f(at(j- $)h). (10.25) 


This follows from the fact that when n = 0 there is one quadrature 
point > = 0 in (—1,1), which is at the midpoint of the interval, and 
the corresponding quadrature weight Wo is equal to the length of the 
interval (—1,1), ¢.e., Wo = 2. It follows from (10.24) with n = 0 and 


1 

Co= / dt = 3 

-1 

that the error in the composite midpoint rule is 

(b ee a)? ” 
sae ne 


where 7 € (a,b), provided that the function f has a continuous second 
derivative on |a, }]. 


Eom = 
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10.6 Radau and Lobatto quadrature 


We have now discussed two types of quadrature formulae, which have the 
same form, )°;_) We f (xx). In the Newton—Cotes formulae the (equally 
spaced) quadrature points x, are given, and we were able to find the 
weights W;, so that the result was exact for polynomials of degree n. 
By allowing the quadrature points as well as the weights to be freely 
chosen, we constructed Gauss quadrature formulae which were exact for 
polynomials of degree 2n + 1. There are also many possible formulae 
of mixed type, where some, but not all, of the quadrature points are 
given, and the rest can be freely chosen. We might expect that each 
quadrature point which is fixed will reduce the degree of polynomial for 
which such a formula is exact by 1, from the maximum degree of 2n +1. 

It is often useful to be able to fix one of the endpoints of the interval 
as one of the quadrature points. As an example, suppose we prescribe 
that x9 =a. Let pen be an arbitrary polynomial of degree 2n, and write 


Pan(X) = (&— a)qan-1(x) +7, 


where the quotient g2n_1 is a polynomial of degree 2n — 1 and the re- 
mainder r is a constant. The integral of w pa, is then 
b 


i w(a)pan(a)de = | CaO mC ee. [ e@ee. 


a a 

We can now construct the usual Gauss quadrature formula for the in- 
terval [a,b] with the modified weight function (x — a)w(x), giving n 
quadrature points and n weights x7, Wf, k =1,...,n. This formula 
will be exact for all polynomials q of degree 2n — 1. Provided that the 
weight function w satisfies the standard conditions on (a,b), the modi- 
fied weight function does also; in particular it is clearly positive on (a, 6). 
This gives 


II 
S 
SQ 
N 
3 
iu 
& 
ae 

+ 
3 
oo 
oa 

5 
Bae 
Q. 
8 


| PrOrOre 
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The fact that r = p2n(a) then leads us to consider the quadrature rule 


b n 
i w(x) f(a)dx ~ Wof(a)+S— Wef (ze), (10.27) 
Me k=1 


where 


Wr = We/(az—a), k=1,...,n7, 


[ werar - y Wy - 
o k=1 


By construction, this formula is exact for all polynomials of degree 2n. 
It is obvious that W;, > 0 for k = 1,...,n. We leave it as an exercise to 
show that Wo > 0 also (see Exercise 5). 

With only trivial changes it is easy to see how to construct a similar 
formula where instead of fixing x9 = a we fix x, = b. These are known as 


(10.28) 
Wo 


I 


Radau quadrature formulae. We leave it as an exercise to construct 
the formula corresponding to fixing both x = a and 2, = b, which is 
known as a Lobatto quadrature formula; as might be expected, this 
is exact for all polynomials of degree 2n — 1 (see Exercise 7). 

The formal process could evidently be generalised to allow for fixing 
one of the quadrature points at an internal point c, where a < c < b. 
However, this leads to the difficulty that the modified weight function 


wi: 7 (x —c)w(2) 


is not positive over the whole interval (a,b); hence we can no longer be 
sure that it is possible to construct a system of orthogonal polynomials, 
or, even if we can, that these polynomials will have all their zeros real 
and distinct and lying in [a,]. In general, therefore, such quadrature 
formulae may not exist. 


10.7 Note 


For a detailed guide to the literature on Gauss quadrature rules and 
its connection to the theory of orthogonal polynomials, we refer to the 
books cited in the Notes at the end of Chapter 7. 


Exercises 
10.1 Determine the quadrature points and weights for the weight 
function w: « ++ —Inz on the interval (0,1), for n = 0 and 
n=l. 
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The weights in the Gauss quadrature formula are given by (10.7), 
which is 


b 
m= f w(x) [Ly (x)]?da . 


Show that W;, can also be calculated from 


b 
w= | w(a)Ly(a)de . 


(This is a simpler way of calculating W; than (10.7); the impor- 
tance of (10.7) is that it shows that the weights are all positive.) 
Suppose that f has a continuous second derivative on [0,1]. 
Show that there is a point € in (0,1) such that 


[ rf (de = 2f(2) + SF"). 


Let n > 0. Write down the quadrature points 7;, 7 = 0,...,n, 
for the weight function w: 2 +> (1 — 2?)~!/? on the interval 
(-1,1). 

By induction, or otherwise, show that for positive integer val- 
ues of n, 


= in(2n + 2) 
S| cos(2j +1)0= sralgiecn 2) few ) ; 
rm 2sinY 


unless J is a multiple of 7. What is the value of the sum when 
0 is a multiple of 7? 
Deduce that 


n 1 
Sette; =f a- 2) Tac, k= 1,40, 
j=0 =e 


and show that 


YT) == f=?) Ha) ae, 

: vis =i 

j=0 
where T;, is the Chebyshev polynomial of degree n. 

Deduce that the weights of the quadrature formula with weight 
function w: 2 +> (1 — a2?)~!/? on the interval (—1,1) are 
T 
W; => — k = eee . 
k n +4 1 y] 0, 7 

In the notation for the construction of the Radau quadrature 
formula in Section 10.6, show that Wo > 0. 
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The Laguerre polynomials’ L;, j = 0,1,2,..., are the or- 
thogonal polynomials associated with the weight function w: x 
e * on the semi-infinite interval (0, 00), with L; of exact degree 
j. (See Exercise 5.9.) Show that 


if eo *a[L; (a) — L5(a)\pr(x)de = 0 


when p, is any polynomial of degree less than 7. 
In the Radau formula 


[| etpan(x)de = Wopan(0) + Wipan (es) 
0 k=1 


where one of the quadrature points is fixed at x = 0, show that 
the other quadrature points xz, k = 1,...,n, are the zeros of 
the polynomial L,, — Li,. Deduce that 


ie e *po(x)da = $p2(0) a $P2(2) . 


Let n > 2. Show that a polynomial p2,_; of degree 2n — 1 can 
be written 


Pon—1(£) = (t — a)(b— 2)Gan—3(2) + r(z — a) + s(b— 22), 


where gzn—3 is a polynomial of degree 2n — 3, and r and s are 
constants. Hence construct the Lobatto quadrature formula 


b n-1 
J ea)fte)ae ~ Wosla) + Wesles) + Wf) 
S k=1 
which is exact when f is any polynomial of degree 2n — 1. Show 
that all the weights W;, k = 0,1,...,n, are positive. 
Construct the Lobatto quadrature formula 


[2% Aof(-1) + Aisa) + Aa) 


for the interval (—1,1) with weight function w(a) = 1, and with 
n = 2; write down and solve four equations to determine x1, Apo, 
Aj and Ao. 


1 Edmond Nicolas Laguerre (9 April 1834, Bar-le-Duc, France — 14 Aug 1886, Bar- 
le-Duc, France.) 


Exercises 291 


10.9 Write T,, for the composite trapezium rule (7.15), S,, for the 
composite Simpson rule (7.17) and M,, for the composite mid- 
point rule (10.25), each with m subintervals. Show that 


= Alam — Din _ 2Mm a3 Im 


Mm = 212m — Im ; m ’ m— 
" 3 2 3 
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Piecewise polynomial approximation 


11.1 Introduction 


Up to now, the focus of our discussion has been the question of approxi- 
mation of a given function f, defined on an interval [a, 6], by a polynomial 
on that interval either through Lagrange interpolation or Hermite inter- 
polation, or by seeking the polynomial of best approximation (in the 
oo-norm or 2-norm). Each of these constructions was global in nature, 
in the sense that the approximation was defined by the same analytical 
expression on the whole interval [a,b]. An alternative and more flexible 
way of approximating a function f is to divide the interval [a,b] into 
a number of subintervals and to look for a piecewise approximation by 
polynomials of low degree. Such piecewise-polynomial approximations 
are called splines, and the endpoints of the subintervals are known as 
the knots. 

More specifically, a spline of degree n, n > 1, is a function which is a 
polynomial of degree n or less in each subinterval and has a prescribed 
degree of smoothness. We shall expect the spline to be at least continu- 
ous, and usually also to have continuous derivatives of order up to k for 
some k, 0 < k <n. Clearly, if we require the derivative of order n to 
be continuous everywhere the spline is just a single polynomial, since if 
two polynomials have the same value and the same derivatives of every 
order up to n at a knot, then they must be the same polynomial. An 
important class of splines have degree n, with continuous derivatives of 
order up to and including n — 1, but as we shall see later, lower degrees 
of smoothness are sometimes considered. 

To give a flavour of the theory of splines, we concentrate here on two 
simple cases: linear splines and cubic splines. 
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11.2 Linear interpolating splines 


Definition 11.1 Suppose that f is a real-valued function, defined and 
continuous on the closed interval [a,b]. Further, let K = {xo,...,&%m} 
be a subset of [a,b], witha = 4% < a1 < +++ <&m = b, m> 2. The 
linear spline sz, interpolating f at the points x;, is defined by 

si(x) = ———_f (ai-1) + 


Li — Vi-1 Li Li-1 F(a) , 
we [vj_1,%;], t=1,2,...,m. (11.1) 


Li, 2@ L— Lj_-1 


The points x;, 1 = 0,1,...,m, are the knots of the spline, and K is 
referred to as the set of knots. 


As the function sy, interpolates the function f at the knots, i-e., 
su(x;) = f(a;), i = 0,1,...,m, and over each interval [x;_1,2x;], for 
i = 0,1,...,m, the function sy, is a linear polynomial (and therefore 
continuous), we conclude that sz is a continuous piecewise linear func- 
tion on the interval [a, 0]. 

Given a set of knots K = {2,...,%m}, we shall use the notation 
hy = x; — rj~1, and let h = max; h;. Also, for a positive integer k, 
we denote by C*[a,b] the set of all real-valued functions, defined and 
continuous on the closed interval [a,b], such that all derivatives, up to 
and including order k, are defined and continuous on [a, )}. 

In order to highlight the accuracy of interpolation by linear splines we 
state the following error bound in the oo-norm over the interval [a, }]. 


Theorem 11.1 Suppose that f € C?{a,b] and let sz, be the linear spline 
that interpolates f at the knots a= x < 41 < +--+: < Xm = 0; then, the 
following error bound holds: 


1 
If = sulloo S SIP loo: 


where h = max; hi = max;(x; — 2j-1), and || - ||. denotes the co-norm 


over [a,b], defined in (8.1). 


Proof Consider a subinterval [x;_-1,x;], 1 <i <m. According to Theo- 
rem 6.2, applied on the interval [2;-1, xi], 


fle) - sue) = 5f"Ole- Hien), © € forage, 
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where € = €(x) € (a;-1,2;). Thus, 


1 2 " 
P(e) ~ su(e)| S gh? max IOI, 


Hence, 
1 
If (@) = su(0)| < Ff" loo 


for each x € [x;-1,2;] and each i = 1,2,...,m. This gives the required 
error bound. 


Figure 11.1 shows a typical example: a linear spline approximation 
to the function f: 2 + e~%” over the interval [0,1], using two internal 
knots, 71 = 3, rg = Z, together with the endpoints of the interval, 


ro = 0 and x3 = 1. 


A 


y 


j=) 
wlrH@ 
wIn® 

= 


Fig. 11.1. The function f: 2 e 3” (full curve) and its linear spline approxi- 


mation (dotted curve). The interval is [0, 1], and the knots are at 0, 3, 3 and 
1. 


We conclude this section with a result that provides a characterisation 
of linear splines from the viewpoint of the calculus of variations. 

A subset A of the real line is said to have measure zero if it can 
be contained in a countable union of open intervals of arbitrarily small 
total length; in other words, for every ¢ > 0 there exists a sequence of 
open intervals (a;,b;), i = 1,2,3,..., such that 


AC Wen and wore <eé. 
i=1 em 
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In particular, any finite or countable set A C R has measure zero. For 
example, the set of all rational numbers is countable, and therefore it 
has measure zero. Trivially, the empty set has measure zero. 

Suppose that B is a subset of R. We shall say that a certain property 
P = P(a) holds for almost every wx in B, if there exists a set A C B of 
measure zero such that P(x) holds for all z € B\ A. 

A real-valued function v defined on the interval [a,b] is said to be 
absolutely continuous on (a, | if it has finite derivative v’(€) at almost 
every point € in [a,b], vu’ is (Lebesgue-) integrable on [a,b], and 


[vag = ole) - va), ax<aex<b. 


Example 11.1 Any v € C'[a,}] is absolutely continuous on the interval 
[a,b]. The function x +> |x — $(a+)| is absolutely continuous on [a, b], 
but it does not belong to C'{a, b| as it is not differentiable at x = (a+b). 


Let us denote by H'(a, b) the set of all absolutely continuous functions 
v defined on [a, b] such that v’ € L?(a,)), i.e., 


b 1/2 
I|v'll2 = (/ Wea) <0. 


We observe in passing that any function v € H!(a,b) is uniformly 
continuous on the closed interval [a,b]. This follows by noting that, for 
any pair of points x,y € [a, b], 


joe) oa) = | f “vag 


[wera 


1 
< |e—yl?llv'lle- 


Fi 1/2 
< |e—yl? 


In the transition from the first line to the second we used the Cauchy— 
Schwarz inequality. 

If k > 1, we shall denote by H*+!(a, b) the set of all v € H*(a, b) such 
that v is absolutely continuous on [a,b] and vt) € L?(a,b). The set 
H*(a, b) is called a Sobolev space of index k. We observe that 


C* a, b] C H*(a, b) 


for any k > 1, with strict inclusion. For example, any linear spline on 
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[a,b] belongs to H!(a, b), but not to C1[a, b] unless it is a linear function 
over the whole of the interval [a, }]. 


Example 11.2 Let a > 1/2; the function x +> x° then belongs to 
H!(0,1), although it only belongs to C10, 1] ifa>1. 

As a second example, consider the function x > xlnx which belongs 
to H'(0,1), but not to C10, 1]. 


The variational characterisation of linear splines stated in the next 
theorem expresses the fact that, among all functions v € H!(a,b) which 
interpolate a given continuous function f at a fixed set of knots in [a, 0], 
the linear spline sz, that interpolates f at these knots is the ‘flattest’, in 
the sense that its ‘average slope’ ||s{,||2 is smallest. 


Theorem 11.2 Suppose that sz is the linear spline that interpolates 
f € Cla, b] at the knots a = 4 < 41 < +++ < 2m = b. Then, for any 
function v in H'(a,b) that also interpolates f at these knots, 


IIstll2 < llv'll2- 
Proof Let us observe that 
b b 
WR = [ @@-s@rart f |si()Pae 
a : a 
+2 f (u(x) — sy,(x))s,(x)da. (11.2) 


We shall now use integration by parts to show that the last integral is 
equal to 0; the desired inequality will then follow by noting that the first 
term on the right-hand side is nonnegative and it is equal to 0 if, and 
only if, v = sy. Clearly, 


b m Lk 
[ @-s@)s@ar=y | (v'(w) = sf, (a) sf, (an) de 
a k=1 xz 


k-1 


= S/[(v(we) — su(wx)) 8, (ee—-) — (v(@p—1) — su (@n—-1)) 84, (@e-14) 


-{" (wa) = sila) de). (11.3) 


k-1 


Now v(a;) — su(ai) = f(ai) — f(a) = 0 for i = 0,1,...,m and, since 
sy is a linear polynomial over each of the open intervals (x,-1, 2%), k = 
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Pr 


Tel UE Vke+1 
Fig. 11.2. The linear basis spline (or hat function) yy, 1<k<m-—1. 
1,2,...,m, it follows that s/ is identically 0 on each of these intervals. 


Thus, the expression in the square bracket in (11.3) is equal to 0 for each 
hE 12,602.59 


Sobolev spaces play an important role in approximation theory. We 
shall encounter them again in Chapter 14 which is devoted to the ap- 
proximation of solutions to differential equations by piecewise polyno- 
mial functions. 


11.3 Basis functions for the linear spline 
Suppose that sz, is a linear spline with knots x;, i = 0,1,...,m, interpo- 
lating the function f € Cla, 6]. Instead of specifying the value of sz, on 
each subinterval [7;-1,x;], i = 1,2,...,m, we can express sy, as a linear 
combination of suitable ‘basis functions’ y; as follows: 


su(t) = Do ye(2)f(we), — @ € [a, 8). 
k=0 


Here, we require that each yz, is itself a linear spline which vanishes at 
every knot except zz, and yx(a~) = 1. The function yz, is often known 
as the linear basis spline or hat function, and is depicted in Figure 
11.2. 
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The formal definition of yx is as follows: 


0 ifoeoe E71, 
L— £p-1)/h if vp_1 << U< ar, 
yr (x) = f k 1)/ k ; k-1 SUSU, 
(te41 — ©)/Aeqi if ty << Le41, 
0 if p41 <2, 
fork =1,...,m—1, and with 
be: (a1 — x)/ho ifa=r%<a2<2, ee 
ie ~ 10 ifa,<a2 
and 
(x) -_ 0 if x SUm-1 ’ 
aia ee (@ — &m—-1)/hm if tm_-1 <2< atm =). 


11.4 Cubic splines 
Suppose that f € Cla,b] and let K = {2o,...,%m} be a set of m+1 
knots in the interval [a,b], a = 49 < 41 <--: < &m = b. Consider the 
set S of all functions s € C?[a, b] such that 


@ s(x;) = f(a), 4 =0,1,...,m, 
@ s isa cubic polynomial on [x;-1, x;], i= 1,2,...,m. 


Any element of S is referred to as an interpolating cubic spline. 
We note that, unlike linear splines which are uniquely determined by 
the interpolating conditions, there is more than one interpolating cubic 
spline s € C?[a, b] that satisfies the two conditions stated above; indeed, 
there are 4m coefficients of cubic polynomials (four on each subinterval 
[v;-1,0;],2 =1,2,...,m), and only m+ 1 interpolating conditions and 
3(m — 1) continuity conditions; since s belongs to C?[a, b], this means 
that s, s’ and s” are continuous at the internal knots 21,...,¢%m_1. 
Hence, we have a total of 4m — 2 conditions for the 4m unknown coeffi- 
cients. Depending on the choice of the remaining two conditions we can 
construct various interpolating cubic splines. 

An important class of cubic splines is singled out by the following 
definition. 


Definition 11.2 The natural cubic spline, denoted by s2, is the ele- 
ment of the set S satisfying the end conditions 


85 (ao) = $3(@m) = 0. 
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We shall prove that this definition is correct in the sense that the two 
additional conditions in Definition 11.2 uniquely determine sg: this will 
be done by describing an algorithm for constructing s2. 

Construction of the natural cubic spline. Let us begin by defin- 
ing o; = s5(a;), i = 0,1,...,m, and noting that s¥ is a linear function 
on each subinterval [a;_1,2;]. Therefore, s’ can be expressed as 


L4— £ L— Ly 
83(a“) = os ai-144 he Oi, x © [wj-1, v;]. 
Integrating this twice we obtain 
3 3 
Ly—- 2 L— By 
S9(2) = ( a ) ian ef ( a 1) > 


+ ai(e — a1) + Bi(ai—@), & € [xi-1, a4], (11.5) 
where a; and (; are constants of integration. Equating sz with f at the 
knots x;_-1, x; yields 


1 1 
f(%e1) = grin + hifi, fa)= grit + hyay. (11.6) 


Expressing a; and (3; from these, inserting them into (11.5) and ex- 
ploiting the continuity of s4 at the internal knots, (i.e., using that 
85(a;—) = sh(aj+),i=1,...,m—1), gives 


hioi—1 + 2(higd + hidow + hig oii 


f (visi) — Flv) — flea) — f(ei-1) 
= 6( an : ) (11.7) 


fori =1,...,m—1, together with 


00 =Om = 0, 


which is a system of linear equations for the o;. The matrix of the 
system is tridiagonal and nonsingular, since the conditions of Theorem 
3.4 are clearly satisfied. By solving this linear system we obtain the oj, 
i=0,1,...,m, and thereby all the a;, 6;, i= 1,2,...,m, from (11.6). 

We have seen in a previous section, in Theorem 11.2, that a linear 
spline can be characterised as a minimiser of the functional v + ||v’||2 
over all v € H!(a, 6) which interpolate a given continuous function at the 
knots of the spline. Natural cubic splines have an analogous property: 
among all functions v € H?(a,b) which interpolate a given continuous 
function f at a fixed set of knots in [a,b], the natural cubic spline sg 
is smoothest, in the sense that it minimises v + ||v’||2, the ‘average 
curvature’ of v. 
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Theorem 11.3 Let s2 be the natural cubic spline that interpolates a 
function f € Cla, b] at the knots a = 2% < 41 <-+++<a&m = 0b. Then, for 
any function v in H?(a,b) that also interpolates f at the knots, 


IIssllo < lle" lle- 


The proof is analogous to that of Theorem 11.2 and is left as an 
exercise. 

The smoothest interpolation property expressed by Theorem 11.3 is 
the source of the name spline.! A spline is a flexible thin curve-drawing 
aid, made of wood, metal or acrylic. Assuming that its shape is given by 
the equation y = v(x), x € [a,b], and is constrained by requiring that it 
passes through a finite set of prescribed points in the plane, v will take 
on a shape which minimises the strain energy 


m b |v’ (x)|? 
B= f Givers” 


over all functions v which are constrained in the same way. If the 
function v is slowly varying, i.e., maxz¢jq,4 |v'(«)| < 1, this energy- 
minimisation property is very similar to the result in Theorem 11.3. 


11.5 Hermite cubic splines 


In the previous section we took f € C[{a,b] and demanded that s be- 
longed to C?[a,b]; here we shall strengthen our requirements on the 
smoothness of the function that we wish to interpolate and assume that 
f € C'[a, b]; simultaneously, we shall relax the smoothness requirements 
on the associated spline approximation s by demanding that s € C1Ja, }] 
only. 

Let K = {Xo0,...,%m} be a set of knots in the interval [a,b] with 
a=% <4 <-+++<2@m =band m > 2. We define the Hermite cubic 
spline as a function s € C*{a, b] such that 


@ s(x;) = f(a), s’(ai) = f’(a;) for i=0,1,...,m, 
@ s is a cubic polynomial on [x;_1,x;] for i= 1,2,...,m. 


Writing the spline s on the interval [x;_1, 7;] as 


s(x) = co + 1 (@ — 24-1) + Co(@ — 24_1)? + 3(2 — 24_1)°, 
LE [xi-1, Za] , (11.8) 


1 See Carl de Boor: A Practical Guide to Splines, Revised Edition, Springer Applied 
Mathematical Sciences, 27, Springer, New York, 2001. 
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we find that co = f(x;-1), c1 = f’(ai-1), and 
Pie) = Tea). ae ee) 


2. = 3 he h; ’ a 9) 
o = PeD+F in) _ fei) = Fein) . 
— h? Re 


Note that the Hermite cubic spline only has a continuous first derivative 
at the knots, and therefore it is not an interpolating cubic spline in the 
sense of Section 11.4. 

Unlike natural cubic splines, the coefficients of a Hermite cubic spline 
on each subinterval can be written down explicitly without the need to 
solve a tridiagonal system. 

Concerning the size of the interpolation error, we have the following 
result. 


Theorem 11.4 Let f € C*[a,b|, and let s be the Hermite cubic spline 
that interpolates f at the knots a= 2% < 41 <-+++ < Xm = 0; then, the 
following error bound holds: 


1 : 
I = Slloo S gaz hl? loo 


where f’” = f“ is the fourth derivative of f with respect to its argument, 
x, h = max; hy = max;(x; — 2-1), and || - ||. denotes the oo-norm on 
the interval [a, b]. 


The proof is analogous to that of Theorem 11.1, except that Theorem 
6.4 is used instead of Theorem 6.2. 

Both the linear spline and the Hermite cubic spline are local approxi- 
mations; the value of the spline at a point x between two knots x;_; and 
x; depends only on the values of the function and its derivative at these 
two knots. On the other hand, the natural cubic interpolating spline is a 
global approximation and, in this respect, it is more typical of a generic 
spline: a change in just one of the values at a knot, f(x,), will alter the 
right-hand side of the system of equations (11.7), so the values of all the 
quantities o; will change. Thus, the spline will change throughout the 
whole interval [29,2]. We conclude this section with an example. 


Example 11.3 Figure 11.3 shows the Hermite cubic spline approxima- 
tion to the function f: x > 1/(1 +27), using four equally spaced knots 
in the interval [0,5]. 
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The accuracy of this approximation is in striking contrast to the La- 
grange polynomial approximation of degree 10 in Figure 6.1. The ap- 
proximation over [—5, 5], using seven equally spaced knots, is obviously 
obtained by symmetry; here we show only half the range for clarity. 


(=) 
wlo® 
S 
ot 


Fig. 11.3. The function f: 2 + 1/(1 +.) (full curve) and its Hermite cubic 
spline approximation (dotted curve). The interval is [0,5], and the knots are 
at 0, 2, 42 and 5. 


> 3° 3 

As the error of this approximation is quite small, we show in Figure 
11.4 graphs of the errors of three spline approximations, each using the 
same four knots. Note that in the first interval, (0, 3], the maximum error 
of the Hermite cubic spline is larger than that of the linear spline, but 
on the other two intervals it is much less. Both of these two splines are 
local approximations, as their values on any interval between two knots 
depend only on information about the function at those two knots. The 
natural cubic spline is a global approximation, as its value at any point 
depends on the values of the function at all the knots; on the first interval 
its error is much the same size as that of the Hermite cubic spline, but 
on the other two intervals its error is affected by this global coupling, 
and is a good deal bigger than that of the Hermite cubic spline. © 


11.6 Basis functions for cubic splines 


We have seen that the family of hat functions forms a basis for the linear 
space of linear splines corresponding to a certain fixed set of knots; we 
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0.10_] ¥ 


0.05_|-n 


—0.10_| 


Fig. 11.4. Errors of three spline approximations to f(x) = 1/(1 +27): Her- 
mite cubic (full curve), natural cubic (dotted curve) and linear spline (broken 


curve). The interval is [0,5], and the knots are at 0, 3, 2 and 5. 


shall now show how to construct a set of basis functions for cubic splines. 
The basis functions for splines are usually known as B-splines. Thus, 
the basis-splines constructed in Section 11.2 are referred to as linear 
B-splines. Here we shall be concerned with the construction of cubic 
B-splines. To simplify the notation we shall assume in this section that 
the knots are equally spaced, so that 


tp =kh, k=0,1,...,n+1, 


with h > 0. 
We begin by introducing the idea of the positive part of a function. 


Definition 11.3 Suppose thatn > 1. The positive part of the function 
xt (a —a)” is the function x > (x — a)" defined by 


(a—a)”, u>a, 


@-a={ § u<a. 


Clearly the function x +> (a—,)"} is a spline of degree n; at the knot 
x, the derivatives of order up to n — 1 are zero, but the derivative of 
order n is not continuous at x = @p. 

Figure 11.5 shows the graphs of the functions + wv, and 4 bh bo 
on the interval [—1, 1]. 

We shall also need the following result. 
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Fig. 11.5. The graph of the function x + («)%, for x in the interval [—1, 1], 
with n = 1 (left) and n = 3 (right). 


Lemma 11.1 Suppose that P is a polynomial in x of degree n > 1. 
Then, for eachr =1,...,n, the function Q(,) defined by 


Tr 


Qeyle) =i ( f ) Plea 


k=0 
is a polynomial of degreen —r and Q(n41)(@) = 9, c ER. 
Proof It is easy to see that Q(1)(%) = P(x) — P(a — h), and therefore 
Q(1) is a polynomial of degree n — 1. Suppose now that, for some r > 0, 


Q(,) is a polynomial in x of degree n—r; then, x ++ Q/,)(%) —Q(r)(a@—h) 
is a polynomial of degree n — r — 1. But 


Qr)(z) — Qn (@ — A) 
= \-(-1)* ( : ) [P(x — kh) — P(x — (k + )h)] 
k=0 


Eco" (Cg) (eda Jem 


= Qr41) (2), (11.10) 
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from the standard properties of binomial coefficients. Hence Q(r+1) is a 
polynomial in x of degree n — r —1, and the result follows by induction. 
Finally, this shows that Qin) is a polynomial of degree 0, and is there- 
fore constant on R. Thus, by the same argument, Q(n41) is identically 
0 on R. 


Theorem 11.5 For each n >= 1, the function Sin) defined by 


n+1 1 
Slo) = (1) ("f* ) @— ant 
k=0 
is a spline of degree n with equally spaced knots kh, k =0,1,...,n+1. 
It has a continuous derivative of order n—1 and is identically 0 outside 
the interval (0,(n + 1)h). 


Proof The function S(,) is clearly a spline as stated, and S(,)(x) is 
identically 0 for « < 0. When x > (n+ 1)h the arguments x — kh, 


k=0,1,...,n+1, of the positive parts are all nonnegative, so that 
n+1 
n+1 * 
Sno) = 0 ("E* ) @—aayr, 
k=0 


and this is identically zero by Lemma 11.1. 


Taking n = 1 we find that 


After normalisation by 1/h so as to have a maximum value of 1, and 
shifting x = 0 to x = xp_1, this yields a representation of the linear hat 
function y, from (11.4) in the form 
ee! 
rah 
which, for 1 < k < n, is nonzero over two consecutive intervals: (x,_-1, Xx] 
and [xp,U%+41)- 

In the same way we obtain a basis function for the cubic spline by 
taking n = 3: 


pr(2) Say(x p24) 5 


S(3)(@) = x3. —4(a — h)§. +6(a — 2h)3. —4(@ — 3h)3. +(a — 4h)3 . 


Normalising so as to have a maximum value of 1 and shifting « = 0 to 
© = Lp—2, we get 
1 
De (2) = Fg Stay(e — tea). 
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Fig. 11.6. Normalised cubic B-spline, w#z(x),2<k<n-1. 


For 2 < k < n—1, this function is nonzero over four consecutive intervals 
(p—2,Uk—-1], [Ce—-1, Lk], (We, ei] and [&p41, R42), and is illustrated in 
Figure 11.6. 

We see that both y, and uv, are nonnegative for all x; this is true for 
a spline basis function of any degree n, n > 1, constructed in this way, 
but we shall not prove it here (see Exercise 6). 

For a finite set of knots a = % < 4 <-+++ < %nj41 = 6b 0n the bounded 
and closed interval [a, 6] the normalised linear basis splines x > o(z) 
and £ ++ (y41(x) are considered only for x in [a,b], so as to avoid 
reference to nonexisting knots (such as 7_1 or £p+2) that lie outside 
(a, b]. A similar comment applies to the normalised cubic basis splines 


Wo; wi, Wn, and Wn+1- 


11.7 Notes 


There are many excellent texts covering the theory of piecewise polyno- 
mial approximation by splines. For a detailed survey of key results we 
refer to Chapters 18-24 of 


>» M.J.D. POWELL, Approximation Theory and Methods, Cambridge 
University Press, Cambridge, 1996. 


You may have noticed that we have given bounds on the error in linear 
spline approximation in Theorem 11.1, and in Hermite cubic spline ap- 
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proximation in Theorem 11.4, but not for the natural cubic spline. The 
analysis of the error in the natural cubic spline approximation is quite 
complicated; Powell gives full details in his book. 

The following are classical texts on the theory of splines. 


» J.H. AHLBERG, E.N. NILSON, AND J.L. WaLsH, The Theory of 
Splines and Their Applications, Mathematics in Science and Engi- 
neering, 38, Academic Press, New York, 1967. 

» C. DE Boor, A Practical Guide to Splines, Revised Edition, Springer 
Applied Mathematical Sciences, 27, Springer, New York, 2001. 

» LARRY L. SCHUMAKER, Spline Functions: Basic Theory, John Wiley 
& Sons, New York, 1981. 


The variational characterisations of splines stated in Sections 11.1 and 
11.3 stem from the work of J.C. Holladay, Smoothest curve approxima- 
tion, Math. Comput. 11, 233-248, 1957. 

Our definition of the Sobolev space H*(a, b) in Section 11.1, based on 
the concept of absolute continuity, is specific to functions of a single vari- 
able. More generally, for functions of several real variables one needs to 
invoke the theory of weak differentiability or the theory of distributions 
to give a rigorous definition of the Sobolev space H*(Q) with Q Cc R"; 
alternatively, one can define H*(Q) by completion of the set of smooth 
functions in a suitable norm. For the sake of simplicity of exposition we 
have chosen to avoid such general approaches. 


Exercises 


11.1 An interpolating spline of degree n is required to have con- 
tinuous derivatives of order up to and including n — 1 at the 
knots. How many additional conditions are required to specify 
the spline uniquely? 

11.2 (i) Suppose that f is a polynomial of degree 1. Show that the 
linear spline sz which interpolates f at the knots x; for i = 
0,1,...,m is identical to f, so that sy, = f. 

(ii) Suppose that f is a polynomial of degree 3. Show that the 
Hermite cubic spline sy which interpolates f at the knots 2, 
i=0,1,...,m, is identical to f, so that sy = f. 

(iii) Suppose that f is a polynomial of degree 3. Show that 
the natural cubic spline sz which interpolates f at the knots 2, 
i=0,1,...,m, is not in general identical to f. 
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Suppose that the natural cubic spline s2 interpolates the func- 
3 on the interval [0,1], the knots being equally 
spaced, so that 2; = ih, i=0,1,...,m, with h =1/m, m > 2. 
Write down the equations which determine the quantities o;. If 
the two additional conditions are 09 = Om = 0, show that these 
equations are not satisfied by o; = f"(a;), i = 1,...,m—1, 
so that sg and f are not identical. If, however, these two addi- 
tional conditions are replaced by 09 = f’’(0), om = f” (1), show 
that o; = f"(a;), = 0,1,...,m, and deduce that sy and f are 
identical. 

A linear spline on the interval [0, 1] is expressed in terms of the 


tion f: 2H « 


basis functions as 
m 
s(z) = S- ApnVp (x) 
k=0 


Instead of being required to interpolate the function f at the 
knots, the spline s is required to minimise || f — s||2. Show that 
the coefficients a, satisfy the system of equations 


Aa=b, 


where the elements of the matrix A are 
1 
Ay =f i(2)es(e)an 
0 
and the elements of b are 


b, = | f(a)pi(a)de. 


Now suppose that the knots are equally spaced, so that x, = 
kh, k = 0,1,...,m, where h = 1/m, m > 2. Show that the 
matrix A is tridiagonal, with A;; = 2h fori=1,...,m—1, and 
determine the other nonzero elements of A. Show also that A 
has the properties required for the use of the Thomas algorithm 
described in Section 3.3. 

In the notation of Exercise 4, suppose that f(a) = x. Verify 
that the system of equations is satisfied by a, = kh, so that 
s=f. 

Now suppose that f(a) = x7. Verify that the equations are 
satisfied by a, = (kh)? + Ch?, where C is a constant to be 
determined. Deduce that s(a,) = f(a~) + Ch?. 
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In the notation of Theorem 11.5, the spline basis function S(,) 
of degree n is defined by 


n+1 
Sey(a) = Do-1*(2* ) @ = an. 
k=0 


Explain why, for any value of a, 


(t —a)%(a@—a)=(x—a)%t?. 


Show that 
Sn) (x) + [(n + 2)h — 2] Sqn) (e — A) = Singiy(2). 


Hence show by induction that S(,)(«) > 0 for all x. 
Use the result of Exercise 6 to show by induction that each basis 
function S(,) is symmetric; that is, 


S(n)(p + 2) = S(n)(p — 2) 
for all z, where p= $(n+ Ih. 
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Initial value problems for ODEs 


12.1 Introduction 


Ordinary differential equations frequently occur in mathematical models 
that arise in many branches of science, engineering and economics. Un- 
fortunately it is seldom that these equations have solutions which can be 
expressed in closed form, so it is common to seek approximate solutions 
by means of numerical methods. Nowadays this can usually be achieved 
very inexpensively to high accuracy and with a reliable bound on the er- 
ror between the analytical solution and its numerical approximation. In 
this section we shall be concerned with the construction and the analysis 
of numerical methods for first-order differential equations of the form 


y = f(x,y) (12.1) 
for the real-valued function y of the real variable x, where y’ = gu 


and f is a given real-valued function of two real variables. In order to 
select a particular integral from the infinite family of solution curves 
that constitute the general solution to (12.1), the differential equation 
will be considered in tandem with an initial condition: given two real 
numbers 29 and yo, we seek a solution to (12.1) for 2 > xo such that 


y(%o) = Yo. (12.2) 


The differential equation (12.1) together with the initial condition (12.2) 
is called an initial value problem. 

If you believe that any initial value problem of the form (12.1), (12.2) 
possesses a unique solution, take a look at the following example. 


310 


12.1 Introduction 311 


Example 12.1 Consider the differential equation y’ = |y|%, subject to 
the initial condition y(0) = 0, where a is a fixed real number, a € (0,1). 


It is a simple matter to verify that, for any nonnegative real number c, 


ies (1 —a)ra(x—c)r, cK r<m, 
. 0, O<a<e, 


is a solution to the initial value problem on the interval [0,00). Conse- 
quently the existence of the solution is ensured, but not its uniqueness; 
in fact, the initial value problem has an infinite family of solutions {y-}, 
parametrised by c > 0. 

We note in passing that in contrast with the case of a € (0,1), when 
a > 1, the initial value problem y’ = |y|*, y(0) = 0 has the unique 
solution y(x) = 0. © 


Example 12.1 indicates that the function f has to obey a certain 
growth condition with respect to its second argument so as to ensure 
that (12.1), (12.2) has a unique solution. The precise hypotheses on 
f guaranteeing the existence of a unique solution to the initial value 
problem (12.1), (12.2) are stated in the next theorem. 


Theorem 12.1 (Picard’s Theorem!) Suppose that the real-valued 
function (x,y) > f(x,y) ts continuous in the rectangular region D de- 
fined by xp Se < Xu, yo-C Sy < yo C; that |f(x, yo)| < K when 
to <a < Xy; and that f satisfies the Lipschitz condition: there exists 
L > 0 such that 


|f(a,u) — f(a,v)| < Llu-—v| forall (a,u) Ee D, (a,v) €D. 
Assume further that 


L(X 
> K (Xm—2o) ; : 
C (c 1) (12.3) 


Then, there exists a unique function y € C![xo, Xu] such that y(ao) = yo 
and y! = f(x,y) for x € [xo, XJ; moreover, 


ly(z)-—yo| <C, toes Xy. 


1 Charles Emile Picard (24 July 1856, Paris, France — 11 December 1941, Paris, 
France). Although as a child he was a brilliant pupil, Picard disliked mathemat- 
ics and only became interested in the subject during the vacation following his 
secondary studies. He was appointed to the chair of differential calculus at the 
Sorbonne in Paris at the age of 29 but could only take up his position a year later, 
as university regulations prevented anyone below the age of 30 holding a chair. 
Picard made important contributions to mathematical analysis and the theory of 
differential equations. 
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Proof We define a sequence of functions (y»,)°, by 
yo(t) = Yo, 
Yn(2) Yo + ia f(s,Yn-1(s))ds, n=1,2,.... (12.4) 
xo 


l| 


Since f is continuous on D, it is clear that each function y, is continuous 
on [%9, X yz]. Further, since 


msila) =yo+ fo F(ssan(s))as 


it follows by subtraction that 


x 


Yn+1 (2) — Yn(2) = [f(s yn(s)) — (Ss, yn—a(s))] ds (12.5) 


xo 


We now proceed by induction, and assume that, for some positive 


value of n, 
K\L(z#-2 
lyn (x) ee Yn—1(2)| < - ( AI 0) » t <a Xy, (12.6) 
and that 
k 
K (4% — Xo 
Iva) —wol < = ca 


r<¢r<Xy, k=1,...,n. (12.7) 


Trivially, the hypotheses of the theorem and (12.4) imply that (12.6) 
and (12.7) hold for n = 1. 
Now, (12.7) and (12.3) yield that 


K 
Iye(@) —yol <= (e*%™—=) 1) <, 


t<a<Xy, kK=1,...,n. 


Therefore (2, Yn—1(@)) € D and (2, Yn(a)) € D for all x € [ag, Xyy]. 
Hence, using (12.5), the Lipschitz condition and (12.6), 


‘le K[L (= toll" LAS =O) ae 


x —x)\?tt 


IA 


IYn+1(@) — Yn(x)| 
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for all x € [%o, Xyz]. Moreover, using (12.8) and (12.7), 


lyn+1(z) — Yol S< [Yn41(x) — yn(x)| + lyn(x) — yo! 
K [L(a - ag)|"*1 K< [L(x — xo) | 
LE (n+1)! L 


S 


n+1 
L(x — x9)|?*1 
ut oo, G = ‘ (12.9) 


for all « € [ao, Xx]. Thus, (12.6) and (12.7) hold with n replaced by 
n+41, and hence, by induction, they hold for all positive integers n. 

Since the infinite series pa (c) /j!) converges (to e°—1) for any value 
of c € R, and for c = L(Xy — 20) in particular, it follows from (12.6) 
that the infinite series 


dlyi(e ) = yj-1(2)] 


showing that the sequence of continuous functions (y,) converges to a 
limit, uniformly on [2o, Xj], and hence that the limit itself is a contin- 
uous function. Calling this limit y, we see from (12.4) that 


yx) = lim Yyn4i(2) 


= yor+ lim FCs, Yn(s))ds , 


= wtf f(s:un(s))ds, 
xo 


= wf eee de: (12.10) 


where we used the uniform convergence of the sequence of functions (yp) 
in the transition from line two to line three to interchange the order of 
the limit process and integration, and the continuity of the function f 
in the transition from line three to line four. As s +> f(s,y(s)) isa 
continuous function of s on the interval [ao, Xa], its integral over the 
interval [79,2] is a continuously differentiable function of x. Hence, by 
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(12.10), y is a continuously differentiable function of x on [2o, XJ; 7-e., 
y € Cro, Xy]. On differentiating (12.10) we deduce that 
y = f(a,y), 


as required; also y(2o) = yo. We have already seen that (x, yn(x)) € D 
when tp < 2 < Xyy; as D is a closed set in R?, on letting n — oo it 
then follows that also (x, y(a)) € D when 7p <a < Xy. 

To show that the solution of the initial value problem is unique, sup- 
pose, if possible, that there are two different solutions y and z. Then, 
by subtraction, 


y(a) — 2(x2) = | “CGAONataeees. eh te: 
from which it follows that 
ly(a) — 2(@)| <L / ly(s) — 2(s)|ds (12.11) 


for all  € [ap,Xy]. Suppose that m is the maximum value of the 

expression |y(a) — z(a)| for v9 <a < Xyy, and that m > 0. Then, 
ly(a) — 2(a)| < mL (a — 20), to<n<Xy. 

Substituting this inequality into the right-hand side of (12.11) we find 


[L(x — xo)]” 
2! 


xz 
ly(x) — z(a)| < Lm | (s — 29) ds =m 
xo 
Proceeding in a similar manner, it is easy to show by induction that 


L(x — xo % 

ie mol A Me ee 
for all x € [a, Xy,]. However, the right-hand side in the last inequality 
is bounded above by m[L(X jy — x0)|*/F! for all x € [29, XyyJ, which can 
be made arbitrarily small by choosing fk sufficiently large. Therefore, 
|\y(x) — z(a)| must be zero for all # € [a9, Xj]. Hence the solutions y 
and z are identical. 


ly(x) — 2(2)| <m 


In an application of this theorem it is necessary to choose a value of 
the constant C' in Picard’s Theorem so that the various hypotheses are 
satisfied, in particular (12.3); it is not difficult to see that if Of /Oy is 
continuous in a neighbourhood of (29, yo) the conditions will be satisfied 
if Xjy — Zp is sufficiently small. 
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As a very simple example, consider the linear equation 


y’ =pyta, (12.12) 


where p and qg are constants. Then, L = |p|, independently of C, and 
K = |pyo| + |q|. Hence, for any interval [xo, Xj], the conditions are 
satisfied by choosing C' sufficiently large; therefore, the initial value 
problem has a unique continuously differentiable solution, defined for 
all x € [x9, 00). 

Now, consider another example 


y=y, y(0)=1. 


Here for any interval [0,X.] we have K = 1. Choosing any positive 
value of C we find that 


Ju? — v?| = |u+v|lu—v| < Llu VYu,v eR, 


where L = 2(1+ C). We therefore now require the condition 


1 
> 2(1+C)Xm _ 4 . 
C2 040) (c ) 
This is satisfied if 
1 
Kee FC) = SS I OC 
Mu < F(C) x +0) n(1+2C +2C*%), 


where In means log,. A sketch of the graph of the function F' against C 
shows that F takes its maximum value near C' = 1.714, and this gives 
the condition Xjz < 0.43 (see Figure 12.1). 

Thus, we are unable to prove the existence of the solution over the 
infinite interval [0,00). This is correct, of course, as the unique solution 
of the initial value problem is 


Pa: 
~ l-a 


y(x) , O<a<l, 


and this is not continuous, let alone continuously differentiable, on any 
interval [0, Xjz] with Xj, > 1. The conditions of Picard’s Theorem, 
which are sufficient but not necessary for the existence and the unique- 
ness of the solution, have given a rather more restrictive bound on the 
size of the interval over which the solution exists. 

The method of proof of Picard’s Theorem also suggests a possible 
technique for constructing approximations to the solution, by determin- 
ing the functions y, from (12.4). In practice it may be impossible, or 
very difficult, to evaluate the necessary integrals in closed form. We 
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2 
cC 


Fig. 12.1. Graph of the function C + F(C) on the interval [0,4]; F’ achieves 
its maximum value near C = 1.714 and F(C) < 0.43 for all C > 0. 


leave it as an exercise (see Exercise 3) to show that for the simple linear 
equation (12.12), with initial condition y(0) = 1, the function y,, is the 
same as the approximation obtained from the exact solution by expand- 
ing the exponential function as a power series and retaining the terms 
up to the one involving x”. 

In the rest of this chapter we shall consider step-by-step numerical 
methods for the approximate solution of the initial value problem (12.1), 
(12.2). We shall suppose throughout that the function f satisfies the 
conditions of Picard’s Theorem. Suppose that the initial value problem 
(12.1), (12.2) is to be solved on the interval [%o, Xx]. We divide this 
interval by the mesh points rz, = x + nh, n = 0,1,...,N, where 
h=(Xyw—4o)/N and N is a positive integer. The positive real number 
h is called the step size or mesh size. For each n we seek a numerical 
approximation yp to y(an), the value of the analytical solution at the 
mesh point Z,; these values y, are calculated in succession, for n = 
be Dice use 
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12.2 One-step methods 


A one-step method expresses y,+1 in terms of the previous value y,; 
later on we shall consider k-step methods, where y,+1 is expressed in 
terms of the k previous values yn_—%41,---;Yn, Where k > 2. The simplest 
example of a one-step method for the numerical solution of the initial 
value problem (12.1), (12.2) is Euler’s method. 

Euler’s method. Given that y(ao) = yo, let us suppose that we have 
already calculated y,, up to some n, 0<n< N—1, N > 1; we define 


Yn+1 = Yn + hf (@nsYn) : 


Thus, taking in succession n = 0,1,...,N —1, one step at a time, the 
approximate values y, at the mesh points x, can be easily obtained. 
This numerical method is known as Euler’s method. 

In order to motivate the definition of Euler’s method, let us observe 
that on expanding y(%n+41) = y(@n + h) into a Taylor series about vp, 
retaining only the first two terms, and writing y/(an) = f(@n,y(Xn)), we 
have that 


y(n +h) = y(a@n) + hf (en, y(an)) + O(h?) : 


After replacing y(z,) and y(a@, +h) by their numerical approximations, 
denoted by yn and Yyn41, respectively, and discarding the O(h?) term, 
we arrive at Euler’s method. 

More generally, a one-step method may be written in the form 


Ynt+1 = Yn + h®(en, Yn; h), n=0,1,...,.N—1, y(Xo) = Yo; 
(12.13) 
where ®(-, -; - ) is a continuous function of its variables. For example, 


in the case of Euler’s method, ®(2y, yn;h) = f(@n, Yn). More intricate 
examples of one-step methods will be discussed below. 

In order to assess the accuracy of the numerical method (12.13), we 
define the global error, e,,, by 


€n = Y(Zn) — Yn- 
We also need the concept of truncation error, T;,, defined by 


Y(tn41) — y(@n) 
h 


The next theorem provides a bound on the magnitude of the global 


Th = O(an, y(4n);h) . (12.14) 


error in terms of the truncation error. 
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Theorem 12.2 Consider the general one-step method (12.13) where, in 
addition to being a continuous function of its arguments, ® is assumed 
to satisfy a Lipschitz condition with respect to its second argument, that 
is, there exists a positive constant Lo such that, for0 <h< ho and for 
all (x,u) and (a,v) in the rectangle 


D={(z,y): t Sas Xu, ly— yol SCI, 
we have that 
|®(x, u; h) — B(x, v;h)| < Lalu —v). (12.15) 
Then, assuming that |\yn — yo| < C, n=1,2,...,N, tt follows that 


T 
egies (eer = 1) SD Ne (12.16) 
Le 
where T = maxo<n<n-1 |In!|- 


Proof Rewriting (12.14) as 
y(@n41) = y(n) + hO(an, yen); h) + hTn 
and subtracting (12.13) from this, we obtain 
Cn+1 = en + h[®(an, y(4n); h) — O(@n, Yn h)| + Th. 


Then, since (a, y(@p)) and (Lp, Yn) belong to D, the Lipschitz condition 
(12.15) implies that 


lenta| len| + hLelen| + hlThI, n=0,1,...,N—1. (12.17) 
That is, 
lengil < (1+ hLe)len| + hl|Ty|, n=0,1,...,N—-1. 


It easily follows by induction that 


T 
lea] = Ea hLe)” ‘ie n=0,1,...,N, 


since €9 = 0. Observing that 1+ hLe < exp(hLe) gives (12.16). 


Let us apply this general result in order to obtain a bound on the 
global error in Euler’s method. The truncation error for Euler’s method 


is given by 
Ty = Wen HO — Fens) 
= Y(@n41) _ y(@n) y' (an) : (12.18) 
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Assuming that y € C?[x0, Xi], i.e., that y is a twice continuously dif- 
ferentiable function of x on [%o, Xi], and expanding y(#n+41) about the 
point x, into a Taylor series with remainder (see Theorem A.4), we have 
that 

h? 
Y(@n41) = y(@n) + hy’ (an) + ay (én) ? En <&n < En41- 


Substituting this expansion into (12.18) gives 


1 
Tr = shy" (En) « 
2 
Let Mz = max¢ejzy,.xy) |y¥”(Q|. Then, |Z,| < 7,2 = 0,1,...,N —1, 
where T = $hMg. Inserting this into (12.16) and noting that for Euler’s 
method ®(2n,Yn;h) = f(an, Yn) and therefore Lg = L where L is the 


Lipschitz constant for f, we have that 


1 = Lap 


n| < <M. ; =0,1,...,N. 12.1 
Jnl < 5M [| n,n =o (12.19) 

Let us highlight the practical relevance of our error analysis by focus- 
ing on a particular example. 


Example 12.2 Let us consider the initial value problem y’ = tan~' y, 
y(0) = yo, where yo is a given real number. In order to find an upper 
bound on the global error en = y(Xn) —Yn, where Yn, is the Euler approx- 
imation to y(an), we need to determine the constants L and Mg in the 
inequality (12.19). 


Here f(x,y) = tan~! y; so, by the Mean Value Theorem (Theorem A.3), 


noe Zen Ce 


lu —v| , 


_|of 
= em 


where 77 lies between u and v. In our case 

of 2\-1 
— = |(1 <1 
teu) last, 
and therefore L = 1. To find M3 we need to obtain a bound on |y”| (with- 
out actually solving the initial value problem!). This is easily achieved 
by differentiating both sides of the differential equation with respect to 
the variable z: 


d d 
y"” = (tant y) = (1+ y?) 1 = (1+ y?) tan ty, 
dx dx 
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Therefore |y’(x)| < Mz = $7. Inserting the values of L and Mg into 


(12.19) and noting that 7» = 0, we have 
len| < ga(e™ —1h, n=0,1,...,N. 


Thus, given a tolerance TOL, specified beforehand, we can ensure that 
the error between the (unknown) analytical solution and its numerical 
approximation does not exceed this tolerance by choosing a positive step 


size h such that 
4 


h< eee) TOL. 
For such h we shall have |y(%,) — yn| = |en| < TOL, for n = 0,1,...,N, 
as required. Thus, at least in principle, we can calculate the numerical 
solution to arbitrarily high accuracy by choosing a sufficiently small step 
size h. 

A numerical experiment shows that this error estimate is rather pes- 
simistic. Taking, for example, yo = 1 and Xjyz = 1, our bound implies 
that the tolerance TOL = 0.01 will be achieved with h < 0.0074; hence, 
it would appear that we need N > 135. In fact, using N = 27 gives 
a result from Euler’s method which is just within this tolerance, so the 
error estimate has predicted the use of a step size which is five times 
smaller than is actually required. © 


Example 12.3 As a more typical practical example, consider the prob- 


lem 
y =y'+9(z), (0) =2, (12.20) 
where 
x* — 6a + 120? — 142 + 9 
g(x) = ; 5 
(1+<2) 
is so chosen that the solution is known, and is 
= (l—2)(2—2) 
OM) ae 


The results of some numerical calculations on the interval x € [0, 1.6] are 
shown in Figure 12.2. They use step sizes 0.2, 0.1 and 0.05, and show 
how halving the step size gives a reduction of the error also by a factor 
of roughly 2, in agreement with the error bound (12.19). © 


12.3 Consistency and convergence 321 


Fig. 12.2. Euler’s method for the solution of (12.20). The exact solution (solid 
curve) and three sets of results are shown (large, medium and small dots), 
using respectively 8 steps of size 0.2, 16 steps of size 0.1 and 32 steps of size 
0.05 on the interval {0, 1.6]. 


12.3 Consistency and convergence 


Returning to the general one-step method (12.13), we consider the choice 
of the function ®. Theorem 12.2 suggests that if the truncation error 
‘approaches zero’ as h — 0, then the global error ‘converges to zero’ 
also. This observation motivates the following definition. 


Definition 12.1 The numerical method (12.13) is consistent with the 
differential equation (12.1) if the truncation error, defined by (12.14), is 
such that for any € > 0 there exists a positive h(e) for which |T,,| < ¢ 
for0<h<A(e) and any pair of points (an, y(an)), (@n41, y(Ln41)) On 
any solution curve in D. 


For the general one-step method (12.13) we have assumed that the 
function ®(-,-;-) is continuous; since y’ is also a continuous function 
on [xo, Xj] it follows from (12.14) that, in the limit of 


h— 0 and n> o0, with limn_..~ tn = x € [20, Xu], 


we have 
lim T, = y'(x) — ®(a, y(a);0). 


n—Cco 


In this limit h tends to zero and n tends to infinity in such a way that x, 
tends to a limit point x which lies in the interval [7, Xj]. This implies 
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that the one-step method (12.13) is consistent if, and only if, 
P(x, y;0) = f(x,y). (12.21) 


This condition is sometimes taken as the definition of consistency. We 
shall henceforth always assume that (12.21) holds. 

Now, we are ready to state a convergence theorem for the general 
one-step method (12.13). 


Theorem 12.3 Suppose that the initial value problem (12.1), (12.2) sat- 
isfies the conditions of Picard’s Theorem, and also that its approximation 
generated from (12.18) when h < ho lies in the region D. Assume fur- 
ther that the function ®(-,-;-) is continuous on Dx [0,ho], and satisfies 
the consistency condition (12.21) and the Lipschitz condition 


|®(x,u; h) — O(a, v;h)| < Lalu —v| on D x [0, ho]. (12.22) 


Then, if successive approximation sequences (yn), generated by using the 
mesh points 2, = % +nh, n = 1,2,...,N, are obtained from (12.18) 
with successively smaller values of h, each h less than ho, we have con- 
vergence of the numerical solution to the solution of the initial value 
problem in the sense that 


lim yp, = y(2) as Lp, > x € [%o, Xy] whenh > 0 andn— co. 


Proof Suppose that h = (Xj4—20)/N, where N is a positive integer. We 
shall assume that N is sufficiently large so that h < ho. Since y(xo) = yo 
and therefore eg = 0, Theorem 12.2 implies that 


ele(Xu—xo) _ 1 
Sy te e = ee lV. 
ly(@n) — Yn| S ( i ) pee ale: Tea 
(12.23) 
From the consistency condition (12.21) we have 
Yen 7 Yen 
T, = (Beever) _ s¢2,,.u(0n))] 


According to the Mean Value Theorem, Theorem A.3, the expression 
in the first bracket is equal to y’(€n) — y/(an), where €) © [@n,2n41]- 
By Picard’s Theorem, y’ is continuous on the closed interval [ao, Xj]; 
therefore, it is uniformly continuous on this interval. Hence, for each 
€ > 0 there exists hi(<) such that 


ly’ (Er) —y'(an)|< Se = forh<Ai(e), n=0,1,...,N-1. 
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Also, since ®(-,-;-) is a continuous function on the closed set D x [0, ho] 
and is, therefore, uniformly continuous on D x [0, ho], there exists ho(e) 
such that 


|®(an, yan); 0) — (an, y(n); h)| < Ze 


for h < ho(e),n =0,1,..., N—1. On defining h(e) = min{hi(e), he(e)}, 
we then have that 


|Tn| <e forh<h(e), n=0,1,...,.N—-1. 


Inserting this into (12.23) we deduce that 


ly(z) —ynl < ly(x) — y(an)| + ly(@n) — yn 


< [y(x)— vlan) +e" —* (12.28) 
& 

Now, in the limit of h > 0, n > o0 with 2, — x € [ao, Xy], we have 

limnsoo y(fn) = y(%), since y is a continuous function on [xo, Xyz]. 


Further, the second term on the right-hand side of (12.25) can be made 
arbitrarily small, independently of h and n, by letting « — 0. Therefore, 
in the limit of h > 0, n > co with x, —- x € [ao, XJ, we have that 


limn—oo Yn = y(z), as stated. 


We saw earlier that for Euler’s method the magnitude of the trunca- 
tion error T;, is bounded above by a constant multiple of the step size 
h, that is, 

[Tn| < Kh for0<h<ho, 


where K is a positive constant, independent of h. However, there are 
other one-step methods (a class of which, called Runge-Kutta! methods, 
will be considered below) for which we can do better. Thus, in order to 
quantify the asymptotic rate of decay of the truncation error as the step 
size h converges to 0, we introduce the following definition. 


Definition 12.2 The numerical method (12.13) is said to have order 
of accuracy p, if p is the largest positive integer such that, for any suf- 
ficiently smooth solution curve (x, y(x)) in D of the initial value problem 
(12.1), (12.2), there exist constants K and ho such that 


\Tn| < Kh? for0<h<ho 


1 After Carle David Tolmé Runge (30 August 1856, Bremen, Germany — 3 Jan- 
uary 1927, Gottingen, Germany) and Martin Wilhelm Kutta (3 November 1867, 
Pitschen, Upper Silesia, Prussia, North Germany (now Byczyna, Poland) — 25 
December 1944, Fiirstenfeldbruck, Germany). 
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for any pair of points (tn, y(4n)), (@n41,y(@n41)) on the solution curve. 


12.4 An implicit one-step method 
A one-step method with second-order accuracy is the trapezium rule 
method 
Ynt1 = Yn + Elf (tn, Yn) + f(@n41;Yn41)] - (12.26) 


This method is easily motivated by writing 


vans) —v(en) = f Saya, 


and approximating the integral by the trapezium rule. Since the right- 
hand side involves the integral of the function 7 + y'(x) = f(x, y(x)) 
we see at once from (7.6) that the truncation error 


Ty = Vent — Wn) 39005, y(n) + fener vltnet)) 


of the trapezium rule method satisfies the bound 
[Trl < Gh?M3, where M3 = maxyetro xx) LY” (2) - (12.27) 


The important difference between this method and Euler’s method 
is that the value y,41 appears on both sides of (12.26). To calculate 
Yn+1 from the known y,, therefore requires the solution of an equation, 
which will usually be nonlinear. This additional complication means an 
increase in the amount of computation required, but not usually a very 
large increase. The equation (12.26) is easily solved for y,41 by Newton’s 
method, assuming that the derivative Of /Oy can be calculated quickly; 
as a starting point for the Newton iteration the obvious estimate 


Yn Si hii eas Yn) ’ 


will usually be close, and a couple of iterations will then suffice. 
Methods of this type, which require the solution of an equation to 
determine the new value y,+41, are known as implicit methods. 
Writing the trapezium rule method in the standard form (12.13) we 
see that 


h®(2n,Yn3h) = Elf (tn, Yn) + f (n+; Yn+1)] 
h 
2 


[f (tn, Yn) + f(€n41, Yn + AP(Ln, Yn; h)] - 
(12.28) 


Hence, the function ® is also defined in an implicit form. 
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In order to employ Theorem 12.2 to estimate the error in the trapez- 
ium rule method we need a value for the Lipschitz constant Lg. From 
(12.28) we find that 

|®(2n, U; h) = (Ln, V; h)| = s\f (an, uv) = f(tn +heut h®(zn, U; h)) 

—f(an,v) — f(an +h,v + h®(ap,v;h))|. 
Hence, 
= s\f (tn, u) — f(tn,v)| 
5|f(2n +h,ut h®(rn,u;h)) — f(tn +h,v t+ h®(rn, v;h))| 
+L slut h®(an,u;h) — v — h®(ep, v;h)| 
< $Lylu—v| + $Lylu—v| + SL fh|O(ep, uj h) — (ay, v;h)|. 


This shows that 
(1 — $hLy) |®(an, ush) — ®(an,v;h)| < Lylu—v, 
and, therefore, 


hig init , provided that 4hLy <1. 
Consequently, (12.16) and (12.27) imply that the global error in the 
trapezium rule method is O(h?), as h tends to 0. 

Figure 12.3 depicts the results of some numerical calculations on the 
interval 2 € [0,1.6] for the same problem as in Figure 12.2. The step 
sizes are 0.4 and 0.2, larger than for Euler’s method; nevertheless we see 
a much reduced error in comparison with Euler’s method, and also how 
the reduction in the step size h by a factor of 2 gives a reduction in the 
error by a factor of about 4, as predicted by our error analysis. 


12.5 Runge-Kutta methods 


Euler’s method is only first-order accurate; nevertheless, it is simple and 
cheap to implement because, to obtain y,+41 from yn, we only require a 
single evaluation of the function f, at (a, yn). Runge-Kutta methods 
aim to achieve higher accuracy by sacrificing the efficiency of Euler’s 
method through re-evaluating f(-, - ) at points intermediate between 
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A 


y 


Fig. 12.3. Trapezium rule method for the solution of (12.20). The exact so- 
lution (solid curve) and two sets of results are shown (large and small dots), 
using respectively 4 steps of size 0.4, and 8 steps of size 0.2 on [0, 1.6]. 


(an, y(@y)) and (an41,y(@n+1)). Consider, for example, the following 
family of methods: 


Yn+1 = Yn + hak, + bka) , (12.29) 
where 

ky = (tn, Yn), (12.30) 

kg = f(tn + ah, yn + Bhk1), (12.31) 


and where the parameters a, b, a and ( are to be determined. 

Note that Euler’s method is a member of this family of methods, 
corresponding to a = 1 and b = 0. However, we are now seeking methods 
that are at least second-order accurate. Clearly (12.29)—(12.31) can be 
written in the form (12.13) with 


GB(an,Yn3 hh) = af (tn, Yn) + bf (an + ah, Yn + Bhf (en, Yn)) - 


By the condition (12.21), a method from this family will be consistent if, 
and only if, a+ 6= 1. Further conditions on the parameters are found 
by attempting to maximise the order of accuracy of the method. 

To determine the truncation error of the method from (12.14) we need 
the higher derivatives of y(x), which are obtained by differentiating the 
function f: 


y'(2n) = f ? 
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y" (an) = fet Jay = fet Ful 

oe) Feet tend Pout diet ee te + fod) 
and so on; in these expressions the subscripts x and y denote partial 
derivatives, and all functions appearing on the right-hand sides are to 


be evaluated at (an, y(2n)). We also need to expand ®(2,, y(xn);h) in 
powers of h, giving (with the same notational conventions as before) 


Blan, y(tn)sh) = af t+b(ft+ohfc+ Bhf fy + 5(ah) fox 
+ aBh? f fry + 5 (Bh) f? fyy ale O(h*)) . 


Thus, we obtain the truncation error in the form 


T, = y(@n + ) am y(n) (rn, y(n); h) 


= f+ 5h(fet ffy) 
+5)" [fox + 2feyl + ful? + fylfe + ff) 
—{af + lf +ahfe t+ Bhf fy + 9(ah)? foe 
+ 0h? f fay + 5 (Bh)? F? fyyl} + O(R*) 
As 1—a—6b=0, the term (1—a-— b)f is equal to 0. The coefficient of 
the term in h is 


I 


$(fe mu T53) — bafe — bBf fy 


which vanishes for all functions f provided that 


ba = bG = $ : 
The method is therefore second-order accurate if 
B=a, a=1-5, b=, a #0, 


showing that there is a one-parameter family of second-order methods of 
this form, parametrised by a # 0. The truncation error of the method 
then becomes 


In = h?{(% t)( fee tit) (3 aS fey 
+ G(fefy + ffG)} + O(h*). (12.32) 
Evidently there is no choice of the free parameter a which will make 
this method third-order accurate for all functions f; this can be seen, 


for example, by considering the initial value problem y’ = y, y(0) = 1, 
and noting that in this case (12.32), with f(x,y) = y, yields 


Tn = gh? y(an) + O(R?) = Eh7e* + O(h?). 
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Two examples of second-order Runge-Kutta methods of the form 
(12.29)—(12.31) are the modified Euler method and the improved Eu- 
ler method. 


(a) The modified Euler method. In this case we take a = § to 
obtain 


1 1 


(b) The improved Euler method. This is arrived at by choosing 
a = 1 which gives 


Yn-+1 = tn + SL Fens Un) Etta hota thes Gaye) 


For these two methods it is easily verified using (12.32) that the trun- 
cation error is of the form, respectively, 


In = ah? ne uf) iG 2 fey f fv f?)| + O(h*), 


T) = 4 te yl) — 5 lee + Bey fn f?)| + O(h3). 


A similar but more complicated analysis is used to construct Runge— 
Kutta methods of higher order. One of the most frequently used methods 
of the Runge-Kutta family is often known as the classical fourth- 
order method: 


where 


In Sh, Yn hk) ’ 


: (12.33) 
In 5h, Yn hk) ’ 


kg = f 


Here kz and ks represent approximations to the derivative y’ at points on 


Ln +h, Yn + hk). 


the solution curve, intermediate between (1p, y(n)) and (%n41, y(@n41)), 
and ®(27, Yn; h) is a weighted average of the k;, i = 1, 2,3,4, the weights 
corresponding to those of Simpson’s rule (to which the classical fourth- 


order Runge-Kutta method reduces when of =0). 
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Fig. 12.4. The errors in three methods for the solution of (12.20) on the in- 
terval [0,1.6]. Reading from the top, the lines (whose slopes indicate first-, 
second- and fourth-order convergence) represent the errors of Euler’s method, 
the trapezium rule method, and the classical Runge-Kutta method respec- 
tively. The horizontal axis indicates the number N = 1.6/h, on a logarithmic 
scale, and the vertical axis shows In |en| = In|y(1.6) — yn|. 


To illustrate the behaviour of the one-step methods which we have dis- 
cussed, Figure 12.4 shows the errors in the calculation of y(1.6), where 
y(x) is the solution to the problem (12.20) on the interval [0, 1.6]. The 
horizontal axis indicates N, the number of equally spaced mesh points 
used in the interval (0, 1.6], on a logarithmic scale, and the vertical axis 
shows In |en| = In|y(1.6) — yn|. The three methods employed are Eu- 
ler’s method, the trapezium rule method, and the classical Runge-Kutta 
method (12.33). The three lines show clearly the improved accuracy of 
the higher-order methods, and the rate at which the accuracy improves 
as N increases. 


12.6 Linear multistep methods 


While Runge-Kutta methods give an improvement over Euler’s method 
in terms of accuracy, this is achieved by investing additional computa- 
tional effort; in fact, Runge-Kutta methods require more evaluations 
of f(-,- ) than would seem necessary. For example, the fourth-order 
method involves four function evaluations per step. For comparison, 
by considering three consecutive points %p—1, tm = In-1 +h, Cn41 = 
In—-1 + 2h, integrating the differential equation between x,_1 and %y41, 
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yields 
En+1 


y(aen41) = y(n) + / fla, yle))de, 


Ln-1 


and applying Simpson’s rule to approximate the integral on the right- 
hand side then leads to the method 


Ynt1 = Yn-1 + sh [F(@n—1, Yn—1) + 4f (En, Yn) + F(@n41,Yngi)I » 
(12.34) 
requiring only three function evaluations per step. In contrast with the 
one-step methods considered in the previous section where only a single 
value y, was required to compute the next approximation y,,41, here we 
need two preceding values, y, and y,_1, to be able to calculate yy 41, 
and therefore (12.34) is not a one-step method. 

In this section we consider a class of methods of the type (12.34) for 
the numerical solution of the initial value problem (12.1), (12.2), called 
linear multistep methods. 

Given a sequence of equally spaced mesh points (x,,) with step size h, 
we consider the general linear k-step method 


k k 
So aja; =h D> Bf (tnt5s Yn)» (12.35) 
j=0 j=0 
where the coefficients ao,...,a,% and (o,..., 8, are real constants. In 


order to avoid degenerate cases, we shall assume that a, # 0 and that 
Qo and (Jo are not both equal to 0. If 6, = 0, then yn+_ is obtained 
explicitly from previous values of y; and f(x;,y;), and the k-step method 
is then said to be explicit. On the other hand, if 6, 4 0, then yn,+~% 
appears not only on the left-hand side but also on the right, within 
f(@n+ks Yn+k); due to this implicit dependence on y,+; the method is 
then called implicit. The method (12.35) is called linear because it 
involves only linear combinations of the y,+; and the f(%n+4;,Yn+4;); 
j =0,1,...,k; for the sake of notational simplicity, henceforth we shall 
often write f,, instead of f(an, yn). 


Example 12.4 We have already seen an example of a linear two-step 
method in (12.34); here we present further examples of linear multistep 
methods. 


(a) Euler’s method is a trivial case: it is an explicit linear one-step 
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method. The implicit Euler method 


Yn4+1 = Yn + hf (an41; Yn+1) (12.36) 


is an implicit linear one-step method. Another trivial example is the 
trapezium rule method, given by 


1 
Yn+1 = Yn + gh (fn+1 a Fn) ; 


it, too, is an implicit linear one-step method. 
(b) The Adams!—Bashforth? method 


1 
Yn+4 = Yn+3 + aa (55 fn43 — 59fn42 + 37 fri — 9fn) 


is an example of an explicit linear four-step method, while the Adams— 
Moulton? method 


1 
Yn+3 = Yn+2 4 aa” (9fnts + 19fnte —5fn+1 — 9fn) 
is an implicit linear three-step method. © 


There are systematic ways of generating linear multistep methods, 
but these constructions will not be discussed here. Instead, we turn our 
attention to the analysis of linear multistep methods and introduce the 
concepts of (zero-) stability, consistency and convergence. The signifi- 
cance of these properties cannot be overemphasised: the failure of any 
of the three will render the linear multistep method practically useless. 


12.7 Zero-stability 


As is clear from (12.35) we need & starting values, yo,...,yz—1, before 
we can apply a linear k-step method to the initial value problem (12.1), 
(12.2): of these, yo is given by the initial condition (12.2), but the others, 


1 John Couch Adams (5 June 1819, Laneast, Cornwall, England — 21 January 1892, 
Cambridge, Cambridgeshire, England) was educated at St John’s College in Cam- 
bridge. In 1841 while he was still an undergraduate, he began to study the irregu- 
larities of the motion of Uranus to discover whether these can be attributed to the 
action of an undiscovered planet. Four years later he gave accurate information 
about the position of the new planet (Neptune) to the director of the Cambridge 
Observatory. Adams made several other contributions to astronomy. 

? F. Bashforth: An Attempt to Test the Theories of Capillary Action by Comparing 

the Theoretical and Measured Forms of Drops of Fluid. With an Explanation 

of the Method of Integration in Constructing Tables Which Give the Theoretical 

Form of Such Drops, by J.C. Adams, Cambridge University Press, 1883. 

F.R. Moulton: New Methods in Exterior Ballistics, University of Chicago Press, 

1926. 
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Y1;-+-)Yk—-1, have to be computed by other means: say, by using a 
suitable one-step method (e.g. a Runge-Kutta method). At any rate, 
the starting values will contain numerical errors and it is important to 
know how these will affect further approximations y,, n > k, which are 
calculated by means of (12.35). Thus, we wish to consider the ‘stability’ 
of the numerical method with respect to ‘small perturbations’ in the 
starting conditions. 


Definition 12.3 A linear k-step method (for the ordinary differential 
equation y’ = f(x, y)) is said to be zero-stable if there exists a constant 
K such that, for any two sequences (Yn) and (Zn) that have been gener- 


ated by the same formulae but different starting values yo, Y1,-+--,Yr—1 
and Zo, 21,---,2k—1, respectively, we have 
Yn — 2n| S K max{|yo — 20], |¥1 — 21|,- ++ 1Ye—1 — Ze—1} (12.37) 


for ty < Xy, and as h tends to 0. 


We shall prove later on that whether or not a method is zero-stable 
can be determined by merely considering its behaviour when applied 
to the trivial differential equation y’ = 0, corresponding to (12.1) with 
f(x,y) = 0; it is for this reason that the concept of stability formulated 
in Definition 12.3 is referred to as zero-stability. While Definition 12.3 
is expressive in the sense that it conforms with the intuitive notion of 
stability whereby ‘small perturbations at input give rise to small per- 
turbations at output’, it would be a very tedious exercise to verify the 
zero-stability of a linear multistep method using Definition 12.3 alone. 
Thus, we shall next formulate an algebraic equivalent of zero-stability, 
known as the Root Condition, which will simplify this task. Before 
doing so, however, we introduce some notation. 

Given the linear k-step method (12.35) we consider its first and sec- 
ond characteristic polynomials, respectively 


k 
pz) = S- ay27, 
j=0 


k 
o(2) = Siajz%, 
jJ=0 
where, as before, we assume that 


apn #0, a, + 8 #0. 
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Before stating the main theorem of this section, we recall a classical 
result from the theory of kth-order linear recurrence relations. 


Lemma 12.1 Consider the kth-order homogeneous linear recurrence 
relation 


AkYntk +:°: + Q1Yn41 + Ayn = 0, n=0,1,2,..., (12.38) 


with ar #0, a9 #0, aj €C R, j = 0,1,...,k, and the corresponding 
characteristic polynomial 


p(z) = agz*® ++» tayztag. 


Let z, 1<r<t,l<k, be the distinct roots of the polynomial p, and 
let m, > 1 denote the multiplicity of z,, with my +---+me=k. Ifa 
sequence (Yn) of complex numbers satisfies (12.38), then 


é 
Yi So pr(n)zr ? for alin >0, (12.39) 
r=1 


where p,(-) is a polynomial in n of degreem,—1,1<r< &@ In 
particular, if all roots are simple, that ism, =1,1<r<k, then the 
Pr, r=1,...,k, are constants. 


Proof We give a sketch of the proof. Let us first consider the case when 
all of the (distinct) roots 21, 22,..., 2% are simple. As, by assumption, 
ao # 0, none of the roots is equal to 0. It is then easy to verify by direct 
substitution that, since p(z,) = 0, r= 1,2,...,k, each of the sequences 
(Yn) = (2?), r=1,2,...,k, satisfies (12.38). 

In order to prove that any solution (y,,) of (12.38) can be expressed 
as a linear combination of the sequences (z/'), (z3),...,(zf), it suffices 
to show that these k sequences are linearly independent. To do so, let 


us suppose that 
Crzt + Cozy +--+ + Crzy =0, for alln = 0,1,2,.... 


Then, in particular, 


Ch + Co + +Ch = 0, 
Citi + Coz. +:--+Ceze = O, 
Ce * bare Cozk-* mens Ca =) OQ 


1 For details, see, for example, pp. 213-214 of P. Henrici, Discrete Variable Methods 
in Ordinary Differential Equations, Wiley, New York, 1962. 
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The matrix of this system of & simultaneous linear equations for the k 


unknowns Cy, Co,...,C, has the determinant 
1 1 1 
ZL 22 wee Zk 
D= : 
k-1 .k-1 k-1 
zy 25 he Be 


known as the Vandermonde determinant, and D = [],,-,(%s — zr). Since 
the roots are distinct, D # 0, so the matrix of the system is nonsingular. 
Therefore Cy, = Cp =--- = Cy = 0 is the unique solution, which then 
means that the sequences (z7'), (z3’),..., (zg) are linearly independent. 

Now, suppose that (y,) is any solution of (12.38); as D 4 0, there 
exists a unique set of k constants, C,,Co,...,Cx, such that 


Ym = Cray + Cozy’ +--+ + Cezy’, m=0,1,...,k-1. (12.40) 
Substituting these equalities into (12.38) for n = 0, we conclude that 
0 = anyytopi(Ciep +--+ Chap) te 
+a0(Ci 2? +--+ + Cyzp) 
anye + Cr(p(21) — an2t) +++ + Cx(p(ze) — anzh) 
= an(ye —(Cizt +++- + Cezp))- 


As ax, 4 0, it follows that 
Yr = Cyzh +--+ + Chae, 


which, together with (12.40), proves (12.39) for 0 <n < k in the case of 
simple roots. Next, we select n = 1 in (12.38) and proceed in the same 
manner as in the case of n = 0 discussed above to show that (12.39) 
holds for 0 < n < k+1. Continuing in the same way, we deduce by 
induction that (12.39) holds for all n > 0. 

In the case when p(z) has repeated roots, the proof is similar, except 


that instead of (27’), r=1,2,...,n, the following & sequences are used: 
(zr), 
nz), 
ner) (12.41) 


These can be shown to satisfy (12.38) by direct substitution on noting 
that p(z,) = p'(zp) = +++ = por" Y(z,) = 0, given that z, is a root of 
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p(z) of multiplicity m,, r = 1,2,...,2@ The linear independence of the 
sequences (12.41) follows as before, except instead of [],.-,(2s — 2r), the 
value of the corresponding determinant is now 


é 


Pi= |] (@-2)™t™ [](m-- 1)! 


l<r<s<e r=1 
where O!! = 1, m!! = m!(m—1)!...1! for m = 1,2,.... As the roots 
21, 22,---,2¢ are distinct, we have that D, # 0, and therefore the se- 


quences (12.41) are linearly independent. The rest of the argument is 
identical as in the case of simple roots.! 


Now, we are ready to state the main result of this section. 


Theorem 12.4 (Root Condition) A linear multistep method is zero- 
stable for any initial value problem of the form (12.1), (12.2), where f 
satisfies the hypotheses of Picard’s Theorem, if, and only if, all roots 
of the first characteristic polynomial of the method are inside the closed 
unit disc in the complex plane, with any which lie on the unit circle being 
simple. 


The algebraic stability condition contained in this theorem, namely 
that the roots of the first characteristic polynomial lie in the closed unit 
disc and those on the unit circle are simple, is often called the Root 
Condition. 


Proof of theorem Necessity. Consider the method (12.35), applied to 
y =0: 
AkYntk ++++ + Q1Yn41 + A0Yn = 0. (12.42) 


According to Lemma 12.1, every solution of this kth-order linear recur- 
rence relation has the form 


e 
Yn = S > pr(n)2r, (12.43) 
r=1 


where z, is a root, of multiplicity m, > 1, of the first characteristic 
polynomial p of the method, and the polynomial p, has degree m,. — 1, 
l<r<@,é<k. Clearly, if |z,| > 1 for some r, then there are starting 
values Yo, Y1,---;Yk—1 for which the corresponding solution grows like 
1 We warn the reader that in certain mathematical texts the notation m!! is, instead, 


used to mean m- (m — 2)...5-3-1 for m odd and m- (m — 2)...6-4-2 for m 
even. 
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\z,|", and if |z,| = 1 and the multiplicity is m, > 1, then there is a 
solution growing like n™’~!. In either case there are solutions that grow 
unboundedly as n — co, te, as h — 0 with nh fixed. Considering 
starting values yo, ¥1,---,;Yk—1 Which give rise to such an unbounded 


solution (y,), and starting values 29 = z) = +--+ = 2-1 = 0 for which 
the corresponding solution of (12.42) is (zn) with z, = 0 for all n, we 
see that (12.37) cannot hold. To summarise, if the Root Condition is 
violated, then the method is not zero-stable. 

Sufficiency. The proof that the Root Condition is sufficient for zero- 
stability is long and technical, and will be omitted here. For details, the 
interested reader is referred to Theorem 3.1 on page 353 of W. Gautschi, 
Numerical Analysis: an Introduction, Birkhauser, Boston, MA, 1997. 


Example 12.5 We shall explore the zero-stability of the methods from 
Example 12.4 using the Root Condition. 


(a) The Euler method and the implicit Euler method have first charac- 
teristic polynomial p(z) = z—1 with simple root z = 1, so both methods 
are zero-stable. The same is true of the trapezium rule method. 

(b) The Adams—Bashforth and Adams—Moulton methods considered 
in Example 12.4 have first characteristic polynomials, respectively, p(z) = 
23(z — 1) and p(z) = z?(z—1). These have multiple root z = 0 and 
simple root z = 1, and therefore both methods are zero-stable. 

(c) The three-step method 


llyn+3 + 27Yyn+2 _ 27Yn41 — Llyn 
= 8h (fn43 + 9fnt2+9fnti+ fr) (12.44) 


is not zero-stable. Indeed, the corresponding first characteristic polyno- 
mial p(z) = 11z3 + 27z? — 27z — 11 has roots at z1 = 1, z2 © —0.32, 
z3 = —3.14, so |z3| >1. 

(d) The first characteristic polynomial of the three-step method 


Ynt+3 + Yn+2 — Yn41 — Yn = 2h(fn4e + fn4t) 


is p(z) = 22+ 2% -—z-1=(2+4+1)(2? — 1), which has roots 21/2 = —1, 
z3 = 1. The first of these is a double root lying on the unit circle; 
therefore, the method is not zero-stable. © 
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12.8 Consistency 


In this section we consider the accuracy of the linear k-step method 

(12.35). For this purpose, as in the case of one-step methods, we intro- 

duce the notion of truncation error. Thus, suppose that y is a solution to 

the ordinary differential equation (12.1). The truncation error of (12.35) 

is then defined as follows: 

k 

ye i=0 [ajy(@n4j) > hOB; f(tn+5, y(2n+j))] 

% . 
h s j=0 Bj 

Of course, the definition requires implicitly that o(1) = sn Bb; A 0. 

Again, as in the case of one-step methods, the truncation error can be 

thought of as the residual that is obtained by inserting the solution of the 

differential equation into the formula (12.35) and scaling this residual 

appropriately (in this case dividing through by Wy 8;), so that T, 

resembles y’ — f(x, y(x)). 


ipa (12.45) 


Definition 12.4 The numerical method (12.35) is said to be consistent 
with the differential equation (12.1) if the truncation error defined by 
(12.45) is such that for any € >0 there exists an h(c) for which 


[Tn| <<e€ for0<h< Ale), 
and any k +1 points (an, y(an)),---,(@n+-k; y(@n+-k)) on any solution 
curve in D of the initial value problem (12.1), (12.2). 


Now, let us suppose that the solution to the differential equation 
is sufficiently smooth, and let us expand the expressions y(z,+;) and 
f(tn+5,y(fn+j)) = y/(2n4+;) into Taylor series about the point x,. On 
substituting these expansions into the numerator in (12.45) we obtain 


1 
Tr = —~ [Coy(tn) + Cihy' (an) + Cah? y"(an) +--+] (12.46) 


ho(1) 
where 

k 

Co = Say, 
j=0 
k k 

Cr = SY oja;- 5° 6;, 
j=l j=0 
k k (12.47) 

C. = Sofa; —S°56;, 
j=l j= 
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For consistency we need that, as h — 0 and n > oo with w, — @ € 
[xo, Xm], the truncation error T,, tends to 0. This requires that Co = 0 
and C, = 0 in (12.46). In terms of the characteristic polynomials this 
consistency requirement can be restated in compact form as 


p(t) =0 and p'(1) =o(1)(¢0). 


Let us observe that, according to this condition, if a linear multistep 
method is consistent, then it has a simple root on the unit circle at 
z = 1; thus, the Root Condition is not violated by this root. 


Definition 12.5 The numerical method (12.35) is said to have order 
of accuracy p, if p is the largest positive integer such that, for any suf- 
ficiently smooth solution curve in D of the initial value problem (12.1), 
(12.2), there exist constants K and ho such that 


\Tn| << Kh? for0<h<ho, 


for any k +1 points (an,y(an)),---;(@ntk, Y(Ln+k)) on the solution 
curve. 


Thus, we deduce from (12.46) that the method is of order of accuracy 
p if, and only if, 


Co Cy st Cp QO and Cp+1 # 0. 


In this case, 


C. 
a p+1) 0, (p+1) a p+ly- 
atl? 4 (an) + O(h?**) 


The number C,+1/a(1) is called the error constant of the method. 


Example 12.6 Let us determine all values of the real parameter }, 
b#0, for which the linear multistep method 


Yn+3 + (2b — 3)(Yn42 — Ynti) — Yn = hb( fnt2+ frtt) 


is zero-stable. We shall show that there exists a value of b for which the 
order of the method is 4, and that if the method is zero-stable for some 
value of b, then its order cannot exceed 2. 
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According to the Root Condition, this linear multistep method is zero- 
stable if, and only if, all roots of its first characteristic polynomial 


p(z) = 22 + (2b—3)(z* -—z)-1 


belong to the closed unit disc, and those on the unit circle are simple. 
Clearly, p(1) = 0; upon dividing p(z) by z — 1 we see that p(z) can be 
written in the following factorised form: 


p(z) = (z-Lpi(z), where p1(z) = 27 —2(1—b)z+1. 


Thus, the method is zero-stable if, and only if, all roots of the polynomial 
pi(z) belong to the closed unit disc, and those on the unit circle are 
simple and differ from 1. Suppose that the method is zero-stable. It 
then follows that b 4 0 and b ¥ 2, since these values of b correspond to 
double roots of p1(z) on the unit circle, respectively, z = 1 and z = —1. 
Further, since the product of the two roots of p1(z) is equal to 1, both 
have modulus less than or equal to 1, and neither of them is equal 
to +1, it follows that they must both be strictly complex; hence the 
discriminant of the quadratic polynomial p;(z) must be negative. That 
is, 4(1 — b)? — 4 < 0. In other words, 6 € (0,2). 
Conversely, suppose that 6 € (0,2). Then, the roots of p(z) are 


a=1, 29/3 =1—b+a/1—(b—-1)?. 


Since |22/3| = 1, 22/3 # 1 and z, # 23, all roots of p(z) lie on the 
unit circle and they are simple. Hence the method is zero-stable. To 
summarise, the method is zero-stable if, and only if, b € (0,2). 

In order to analyse the order of accuracy of the method, we note that, 
upon Taylor series expansion, its truncation error can be written in the 
form 


i ay (2 - 2) hy!" (an) + +(6 — b)h3y™ (an) 


+ (150 — 23b)h*y’ (an) + O(R*)| , 
where o(1) = 2b 4 0. If b= 6, then T,, = O(h*) and so the method is of 
order 4. As b = 6 does not belong to the interval (0,2), we deduce that 
the method is not zero-stable for b = 6. 
Since zero-stability requires b € (0,2), in which case 1 — 8 # 0, it 
follows that if the method is zero-stable, then T,, = O(h?). © 
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12.9 Dahlquist’s theorems 


An important result connecting the concepts of zero-stability, consis- 
tency and convergence of a linear multistep method was proved by the 
Swedish mathematician Germund Dahlquist. 


Theorem 12.5 (Dahlquist’s Equivalence Theorem) For a linear 
k-step method that is consistent with the ordinary differential equation 
(12.1) where f is assumed to satisfy a Lipschitz condition, and with 
consistent starting values,‘ zero-stability is necessary and sufficient for 
convergence. Moreover if the solution y has continuous derivative of 
order p+ 1 and truncation error O(h?), then the global error of the 
method, €n = y(Ln) — Yn, ts also O(h?). 


The proof of this result is long and technical; for details of the argu- 
ment, see Theorem 6.3.4 on page 357 of W. Gautschi, Numerical Anal- 
ysis: an Introduction, Birkhauser, Boston, MA, 1997, or Theorem 5.10 
on page 244 of P. Henrici, Discrete Variable Methods in Ordinary Dif- 
ferential Equations, Wiley, New York, 1962. 

By virtue of Dahlquist’s theorem, if a linear multistep method is not 
zero-stable its global error cannot be made arbitrarily small by taking 
the mesh size h sufficiently small for any sufficiently accurate initial data. 
In fact, if the Root Condition is violated, then there exists a solution 
to the linear multistep method which will grow by an arbitrarily large 
factor in a fixed interval of 7, however accurate the starting conditions 
are. This result highlights the importance of the concept of zero-stability 
and indicates its relevance in practical computations. 

A second theorem by Dahlquist imposes a restriction on the order of 
accuracy of a zero-stable linear multistep method. 


Theorem 12.6 (Dahlquist’s Barrier Theorem) The order of accu- 
racy of a zero-stable k-step method cannot exceed k+ 1 if k is odd, or 
k+2 ifk is even. 


A proof of this result will be found in Section 4.2 of Gautschi’s book 
or in Section 5.2-8 of Henrici’s book, cited above. 

Theorem 12.6 makes it very difficult to choose a ‘best’ multistep 
method of a given order. Suppose, for example, that we consider five- 
step methods. The general five-step method involves 12 parameters, of 


! That is, with starting values y; =; =1;(h), j =0,...,k —1, which all converge 
to the exact initial value yo, as h — 0. 
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which 11 are independent: the method is obviously unaffected by mul- 
tiplying all the parameters by a nonzero constant. Now it would be 
possible to construct a five-step method of order 10, by solving the 11 
equations of the form Cy = 0, q = 0,1,...,10, where Cy is given in 
(12.47). But the Barrier Theorem states that this method would not be 
zero-stable, and the order of a zero-stable five-step method cannot ex- 
ceed 6. There is a family of stable five-step methods of order 6, involving 
4 free parameters, and there is no obvious way of deciding whether any 
one of these methods is better than the others. 


Example 12.7 (i) The Barrier Theorem says that when k = 1 the order 
of accuracy of a zero-stable method cannot exceed 2. The trapezium rule 
method has order 2, and is zero-stable. 

(ii) The two-step method 


Yni2— Un = ACGiasot $fnti Fain) 


is zero-stable, as the roots of the first characteristic polynomial, p(z) = 
2? —1, are 1 and -1. A simple calculation shows that its order of 
accuracy is 4; by the Barrier Theorem, this is the highest order which 
could be achieved by a two-step method. 

(iii) The three-step method 


llyn+3 a 27Yyn+2 < 27Yn+1 =| Llyn 
= 3h (fn+3 Te 9 fn+e aT: 9 fn+1 ar te) 


has order 6. The Barrier Theorem therefore implies that this method 1s 
not zero-stable. We have already shown this in Example 12.5(c) using 
the Root Condition. 


It is found that all the zero-stable k-step methods of highest possible 
order are implicit, with @, nonzero. 


12.10 Systems of equations 


In this section we discuss the application of numerical methods to si- 
multaneous systems of differential equations, which we shall write in the 
form 


dy 
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Here y is an m-component vector function of x, and f is an m-component 
vector function of the independent variable x and the vector variable y. 
In component form the system becomes 
OH fy ae tty Um): GSU Dect, 

The system comprises m simultaneous differential equations. To single 
out a unique solution we need m side conditions, and we shall suppose 
that all these conditions are given at the same value of x, and have the 
form 


y(%o) = Yo. 


or, in component form, 


Yj (£0) = U5.05 Pee sees 


where the values of y;,9 are given. This is called an initial value problem 
for a system of ordinary differential equations; we may also require a 
solution of the system on an interval [a,b], with r conditions given at 
one end of the interval and m — r conditions at the other end. This 
constitutes a boundary value problem, and requires different numerical 
methods which are considered in the next chapter. 

All the numerical methods which we have discussed apply without 
change to systems of differential equations; it is only necessary to realise 
that we are dealing with vectors. For example, the first stage of the 
classical Runge-Kutta method (12.33) becomes 


ky = f (@ns Yn) ; 


we must evaluate all the elements of the vector k; before proceeding to 
the next stage to calculate kj, and so on. 

The most important difference which arises in dealing with a sys- 
tem of differential equations is in the practical use of an implicit multi- 
step method. As we have seen, this almost always requires an iterative 
method for the solution of an equation to determine ynii. Applying 
such a method to a system of differential equations now involves the 
solution of a system of equations, which will usually be nonlinear, to de- 
termine the elements of the vector y,,,,. In real-life problems it is quite 
common to deal with systems of several hundred differential equations, 
and it then becomes very important to be sure that the improved effi- 
ciency of the implicit method justifies the very considerable extra work 
in each step of the process. 

We shall not discuss the extension of our earlier analysis to deal with 
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systems of differential equations; in almost all cases we simply need to 
introduce vector notation, and replace the absolute value of a number 
by the norm of a vector. For example, in the proof of Theorem 12.2, 
(12.17) becomes 


llensall < llenll + ALollen|| + AlLall, 2 =0,1,...,N—1, 


where || - || is any norm on R™, with obvious definitions of the global 
error e,, and the truncation error T,,. Similarly, Picard’s Theorem and 
its proof, discussed at the beginning of the chapter in the case of a 
single ordinary differential equation, can be easily extended to an m- 
component system of differential equations by replacing the absolute 
value sign with a vector norm on R™ throughout. 


12.11 Stiff systems 


The phenomenon of stiffness usually appears only in a system of differ- 
ential equations, but we begin by discussing an almost trivial example 
of a single equation, 


y = Ay, y(0) = yo, 


where A is a constant. The solution of this equation is evidently y(x) = 
yo exp(Az). When A < 0 the absolute value of the solution is exponen- 
tially decreasing, so it is sensible to require that the absolute value of our 
numerical solution also decreases. It is very easy to give expressions for 
the result of a numerical solution using Euler’s method and the implicit 
Euler method (12.36). They are, respectively, 


ye =(1+hdA)"yo, —- yh, = (L—-PAyo- 


When A < 0 and h > 0, we have (1 — hA) > 1; therefore, the sequence 
(\y!,|) decreases monotonically with increasing n. On the other hand, 
for \< 0 and h > 0, 


|1+hA| <1 if, and only if, 0<AA| <2. 


This gives the restriction h|\| < 2 on the size of h for which the sequence 
(\y|) decreases monotonically; if h exceeds 2/|A|, the numerical solution 
obtained by Euler’s method will oscillate with increasing magnitude with 
increasing n and fixed h > 0, instead of converging to zero as n — oo. 

We now consider the same two methods applied to the initial value 
problem for a system of differential equations of the form 


y’ = Ay, y(0) = Yo, 
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where A is a square matrix of order m, each of whose elements is a 
constant. For simplicity we assume that the eigenvalues of A are distinct, 
so there exists a matrix M such that MAM~! = A is a diagonal matrix. 
The system of differential equations is therefore equivalent to 


me oe 2(0) =z) = My, 
with z = My. In this form the system reduces to a set of m independent 
equations, whose solutions are 
2; = 2;(0) exp(A;2), oa ie eee 


where the numbers A;, 7 = 1,2,...,m, are the diagonal elements of the 
matrix A, and are therefore the eigenvalues of A. In particular, if all 
the A;, 7 =1,2,...,m, are real and negative, then limz_, +. ||z(«)|| = 0 
and since 


lly(x)|| = 4a-* z(x)|| < ||" le@)I, 


also 
_tim_|ly(2)|| = 0. 
Here || - || is any norm on R™, and the norm on M~? is the associated 


subordinate matrix norm defined in Chapter 2. 
In just the same way, Euler’s method applied to the system gives 


Yn41 = (I+ hA)Yn; 
which leads to 


Zng41 = My,4,=M(I+hA)y, 
= M(I+hA)M~‘z, =(I+hA)zZn. 


Thus, the result y,,,, of Euler’s method applied to the initial value 
problem y’ = Ay, y(0) = Yo, is exactly the same as M~'zn41, where 
Zn+1 is the result of applying Euler’s method to the transformed problem 
z’ = Az, z(0) = Myp; an analogous remark applies to the use of the 
implicit Euler method. 

Suppose that all the eigenvalues \;, 7 = 1,2,...,m, are real and 
negative. Then, in order to ensure that, for a fixed positive value of h, 


Jim. [ynll = 0, 


we must require that, for Euler’s method, h|\;| < 2, 7 =1,2,...,m; for 
the implicit Euler method no such condition is required. The importance 
of this fact is highlighted by a numerical example. 
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We consider the system where A is the 2 x 2 matrix 


Cee —8003 1999 
7 23988 —6004 /’ 


and the initial condition is 


The eigenvalues of A are Ay = —7 and A2 = —14000; the solution of the 


problem is 
—Tx 
e 
y(x) a ( de77@ i 


Clearly, limg—+.0 ||y(x)|| = 0. 

The numerical solution uses 12 steps of size h = 0.004; the results 
are shown in Table 12.1. The second column gives the first component 
of the solution, y;(2) = e~’”, the third column shows the result from 
the implicit Euler method, and the last gives the result of the standard 
Euler method. The last column is a dramatic example of what happens 
when the step size h is too large; in this case h|\2| = 56. The numerical 
values given by the implicit Euler method have an error of a few units in 
the third decimal digit; to get the same accuracy from the Euler method 
would require a step size about 30 times smaller, and about 30 times as 
much work. 

It is clear that the difficulty in the numerical example is caused by the 
size of the eigenvalue —14000, but what is important is its size relative 
to the other eigenvalue. The special constant-coefficient system y’ = Ay 
is said to be stiff if all the eigenvalues of A have negative real parts, and 
if the ratio of the largest of the real parts to the smallest of the real parts 
is large. Most practical problems are nonlinear, and for such problems it 
is quite difficult to define precisely what is meant by stiffness. To begin 
with we may replace the system by a linearised approximation, the first 
terms of an expansion 


y (x) — a (Bre) + PF an, ulan) (a — 2) + J(tn)(y(z) — y(en)) ++ 


1 Indeed, even in the case of variable-coefficient linear systems of differential equa- 
tions, stiffness can be defined in several (nonequivalent) ways; for a discussion of 
the pros and cons of the various definitions, we refer to Section 6.2 of J.D. Lam- 
bert, Numerical Methods for Ordinary Differential Systems, Wiley, Chichester, 
1991. 


346 12 Initial value problems for ODEs 


Table 12.1. The use of Euler’s method and the implicit Euler method 
to solve a stiff system. 


x yi(x) Implicit Euler Euler 
0.000 1.000 1.000 1.000 
0.004 0.972 0.973 0.972 
0.008 0.946 0.946 0.945 
0.012 0.919 0.920 0.918 
0.016 0.894 0.895 0.893 
0.020 0.869 0.871 0.868 
0.024 0.845 0.847 0.843 
0.028 0.822 0.824 0.820 
0.032 0.799 0.802 0.794 
0.036 0.777 0.780 0.941 
0.040 0.756 0.759 —8.430 
0.044 0.735 0.738 505.769 
0.048 0.715 0.718 —27776.357 


where J is the Jacobian matrix of the function f, whose (i, j)-entry is 


GGae= 5 (Ul). 


We can then think of the system as being stiff if the eigenvalues of the 
matrix J(x,,) have negative real parts and if the ratio of the largest of the 
real parts to the smallest is large. Although this gives some indication 
of the sort of problems which may cause difficulty, the behaviour of 
nonlinear systems is much more complicated than this. It is not difficult 
to construct examples in which all the eigenvalues of the Jacobian matrix 
have negative real parts, yet the norm of the solution of the differential 
equation is exponentially increasing as 7 — +00. 

Even though any classification of nonlinear systems of differential 
equations into stiff and nonstiff, based only on monitoring the eigen- 
values of J(x,,), is somewhat simplistic, it does highlight some of the 
key difficulties. Stiff systems of differential equations arise in many ap- 
plication areas, a typical one being chemical engineering. For example, 
in parts of an oil refinery there may be a large number of substances un- 
dergoing chemical reactions with widely different reaction rates. These 
reaction rates correspond to the eigenvalues of the Jacobian matrix, and 
it is not unusual to find the ratio of the largest of the real parts to the 
smallest to be in excess of 10!°. For such problems it is essential to 
find a numerical method which imposes no restriction on the step size; 
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Euler’s method, which might require the restriction 10'!°h < 2, would 
evidently be quite useless. 
Application of the linear multistep method 


k k 
>) ents =D Bf (Cnty, Unt) 


j=0 j=0 


to the equation y’ = Ay leads to the &th-order linear recurrence relation 


k 
S (aj — ANB; )Yn45 = 0- (12.48) 


j=0 


The characteristic polynomial of the linear recurrence relation (12.48) is 


Alternatively, we can write this in terms of the first and second charac- 
teristic polynomials of the linear multistep method as 


1(z; Ah) = p(z) — Aho(z). 


In the present context, the polynomial ( - ; AA) is usually referred to as 
the stability polynomial of the linear multistep method. According 
to Lemma 12.1, the general solution of the recurrence relation (12.48) 
can be expressed in terms of the distinct roots z,, 1<r< ¢, 0<k, of 
m(-;Ah). Letting m, denote the multiplicity of the root z,, 1<r < @, 
m,+---+me =k, we have that 


= Sone : (12.49) 


where the polynomial p,(- ) has degree m, —1,1<r< &@. 
Clearly, the roots z, are functions of Ah. For \ € C, with Re(A) < 0, 
the solution of the model problem 


y=rAy, (0) =, 


converges in C to 0 as x > oo. Thus, we would like to ensure that, when 
a linear multistep method is applied to this problem, the step size h can 
be chosen so that the resulting sequence of numerical approximations 
(Yn) exhibits an analogous behaviour as n — oo, that is, limp—oo Yn = 0. 
By virtue of (12.49), this can be guaranteed by demanding that each root 
Zp = Z,(Ah) has modulus less then 1. 
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Definition 12.6 A linear multistep method is said to be absolutely 
stable for a given value of Xh if each root z, = z,(Ah) of the associated 
stability polynomial m(- ; Ah) satisfies |z-(Ah)| < 1. 


Our aim is, therefore, to single out those values of Ah for which the 
linear multistep method is absolutely stable. 


Definition 12.7 The region of absolute stability of a linear multi- 
step method is the set of all points Ah in the complex plane for which 
the method is absolutely stable. 


Ideally, the region of absolute stability of the method should admit all 
values of A, Re(A) < 0, so as to ensure that there is no limitation on the 
size of h, however large |A| may be. This leads us to the next definition. 


Definition 12.8 A linear multistep method is said to be A-stable if 
its region of absolute stability contains the negative (left) complex half- 
plane. 


Unfortunately, the condition of A-stability is extremely demanding. 
Dahlquist! has shown the following results which are collectively known 
as his Second Barrier Theorem: 


(i) No explicit linear multistep method is A-stable; 
(ii) No A-stable linear multistep method can have order greater than 
Ds 
(iii) The second-order A-stable linear multistep method with the small- 
est error constant is the trapezium rule method. 


The trapezium rule method is a one-step method, so the associated 
stability polynomial has only one root, given by 


ut lie 5Ah 

—1Nh 
Evidently |z| < 1 if Re(hA) = h Re(A) < 0, so the trapezium rule method 
is indeed A-stable. 

To construct useful methods of higher order we need to relax the 
condition of A-stability by requiring that the region of absolute stability 
should include a large part of the negative half-plane, and certainly that 
it contains the whole of the negative real axis. 


1 G. Dahlquist, A special stability problem for linear multistep methods, BIT 8, 
27-43, 1963. 
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The most efficient methods of this kind in current use are the Back- 
ward Differentiation Formulae, or BDF methods. These are the 
linear multistep methods (12.35) in which 6; =0,0<j <k-1,k>1, 
and (, #0. Thus, 


AkYntk +++ + Q0Yn = hBefnte - 
The coefficients are obtained by requiring that the order of accuracy of 
the method is as high as possible, 7.e., by making the coefficients C; zero 
in (12.47) for 7 = 0,1,...,&. For k = 1 this yields the implicit Euler 
method (BDF1), whose order of accuracy is, of course, 1; the method is 
A-stable. The choice of k = 6 results in the sixth-order, six-step BDF 
method (BDF6): 
147yn+6 va, 360Yn+5 ag 450Yn+4 a, 400Yn+3 a 225Yn+2 a 12Yn-+1 a 10Yyn 
= 60hfn+6 - (12.50) 
Although the method (12.50) is not A-stable, its region of absolute sta- 
bility includes the whole of the negative real axis (see Figure 12.5). For 
the intermediate values, k = 2,3,4,5, we have the following Ath-order, 
k-step BDF methods, respectively: 
3Yn+2 7 AYn+1 + Yn = 2hfn+2 ) 
Llyn+s =< 18yn+2 ae 9Yn41 —_ 2Yn = 6hfn+s3 ; 
25ynt4a — 48yn+3 + 36yn+2 — 16Yyn41 + 8yn = 12h f+, 
137Yyn45 — 800Yn44 + 800Yn43 — 200Yn42 + 75Yn41 — 12yn = 6Ohfn+5 , 


referred to as BDF2, BDF3, BDF4 and BDF5. Their regions of absolute 
stability are also shown in Figure 12.5. In each case the region of absolute 
stability includes the negative real axis. Higher-order methods of this 
type cannot be used, as all BDF methods, with k > 6, are zero-unstable. 


12.12 Implicit Runge-Kutta methods 


For Runge-Kutta methods absolute stability is defined in much the same 
way as for linear multistep methods; i.e., by applying the method in 
question to the model problem y’ = Ay, y(0) = yo, A € C, Re(A) < 0, and 
demanding that the resulting sequence (y,) converges to 0 as n — c, 
with AA held fixed. The set of all values of hA in the complex plane for 
which the method is absolutely stable is called the region of absolute 
stability of the Runge-Kutta method. 

Classical Runge-Kutta methods are explicit, and are unsuitable for 
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3 


(e) BDF5 (f) BDF6 


Fig. 12.5. Absolute stability regions in the complex plane for k-step Backward 
Differentiation Formulae, k = 1,2,...,6. In each case the region of absolute 
stability is the set of points in the complex plane outside the white region. In 
each case, the region of absolute stability contains the whole of the negative 
real axis. 
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BDF4 (zoom) 


4 -1 0 3 05 


Fig. 12.6. The dark chequered region in the figure on the left indicates part 
of the absolute stability region in the complex plane for the four-step, fourth- 
order Backward Differentiation Formula, BDF4 (zoom into Figure 12.5(d)); 
here we only show the section of the region of absolute stability for BDF4 
which lies in the rectangle —8 < Re(Ah) < 0 and —4 < Im(Ah) < 4, with 
Re(A) < 0, h > 0. The dark region in the figure on the right shows the 
region of absolute stability for the classical explicit fourth-order Runge-Kutta 
method, RK4. For BDF4, the region of absolute stability includes the whole 
of the negative real axis; clearly, this is not the case for RK4. 


stiff systems because of their small region of absolute stability. Figure 
12.6 depicts the region of absolute stability of the classical fourth-order 
Runge-Kutta method, together with that of the fourth-order Backward 
Differentiation Formula, BDF4. The contrast is striking: while the re- 
gion of absolute stability of BDF4 includes most of the negative half- 
plane and, in particular, all of the negative real axis, for RK4 the region 
of absolute stability is bounded! (for example, along the negative real 
axis it does not extend to the left of, approximately, —2.8). 

Motivated by the fact that BDF methods are implicit, we now go on 
to introduce implicit Runge-Kutta methods, which can also have a large 
region of absolute stability. 

The general s-stage Runge-Kutta method is written 


Ss 
Ynt1 = Yn + hy- bik; , 
4=1 
1 This is not a peculiarity of RK4. It can be shown that every explicit Runge-Kutta 


method has bounded region of absolute stability; see, for example, Section 5.12, 
in J.D. Lambert’s book, cited in the previous section. 
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where 
ke = f(tnt+heitnth> aykj), sis. (12.51) 
j=l 


It is convenient to display the coefficients in a Butcher tableau 


Cl Qi, «+. Als 
Cg Qsi «++ = =Ags 
Di < ge “DG 


The method is then defined by the matrix A = (a;;) € R°**, of order s, 
and the two vectors b = (bi,...,b;)? € R® and e = (c1,...,¢5)? € R®. 
For example, the classical four-stage Runge-Kutta method is defined by 
the tableau 


RNIFN|IF © 


Aalkyy] O Onl 
wiry|| Ovle 


WId]} FR 


i 
6 


The 4 x 4 array representing the matrix A for this method, displayed 
in the upper right quadrant of the tableau, follows the usual notational 
convention that zero elements after the last nonzero element in each row 
of the matrix A are omitted. 

This is an explicit method, shown by the fact that the matrix A is 
strictly lower triangular, with a,j = 0 when 1 <i < j < 4. Each 
value k; can therefore be calculated in sequence, all the quantities on 
the right-hand side of (12.51) being known. 

It is not difficult to construct s-stage implicit methods which are A- 
stable. For example, this can be done by choosing the coefficients c; and 
b; to be the quadrature points and weights respectively in the Gauss 
quadrature formula for the evaluation of 


| g(a)da = = big(ci) - 


The numbers a;; can then be chosen so that the method has order 2s, 
and is A-stable. 
For example, the array 
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g(3 — v3) Z 13(3 ~ 2y3) 


g(3+ 73) |} a(8+ 2/3) Fi 
i a 
2 2 


defines a 2-stage A-stable method of order 4. 

However, there is a heavy price to pay for using implicit methods of 
this kind, as we now have to calculate all the numbers k;, 7 = 1,2,...,s, 
simultaneously, not in succession. For a system of m differential equa- 
tions an implicit linear multistep method requires the solution of m 
simultaneous equations at each step; an s-stage implicit Runge-Kutta 
method requires the solution of sm simultaneous equations. This is 
a considerable increase in cost, and the general implicit Runge-Kutta 
methods cannot compete in efficiency with the Backward Differentiation 
Formulae such as (12.50); their use is almost exclusively limited to stiff 
systems of ODEs. 

The overall computational effort can be somewhat reduced by using 
diagonally implicit Runge-Kutta (or DIRK) methods, in which the 
matrix A is lower triangular, so that a;; = 0 if 7 > 7. A further im- 
provement in efficiency is possible by requiring in addition that all the 
diagonal elements a;; are the same; unfortunately it has proved difficult 
to construct such methods with order greater than 4. 


12.13 Notes 


In this chapter we have only been able to introduce some of the basic 
ideas in what has become a vast area of numerical analysis. In particular 
we have not discussed the practical implementation of the various meth- 
ods. The questions of how to choose the step size h to obtain efficiently 
a prescribed accuracy, and when and how to adjust h during the course 
of the calculation, are dealt with in the following books. 


» E. Harrer, S.P. NORSETT, AND G. WANNER, Solving Ordinary 
Differential Equations I: Nonstiff Problems, Second Edition, Springer 
Series in Computational Mathematics, 8, Springer, Berlin, 1993. 

» A. ISERLES, A First Course in the Numerical Analysis of Differential 
Equations, Cambridge University Press, Cambridge, 1996. 

» J.D. LAMBERT, Numerical Methods for Ordinary Differential Sys- 
tems, John Wiley & Sons, Chichester, 1991. 
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For a study of dynamical systems and their numerical analysis, with 
focus on long-time behaviour, we refer to 


» A.M. STUART AND A.R. HUMPHRIES, Dynamical Systems and Nu- 
merical Analysis, Cambridge University Press, Cambridge, 1999. 


The numerical solution of stiff initial value problems for systems of or- 
dinary differential equations is discussed in 


» E. HAIRER AND G. WANNER, Solving Ordinary Differential Equa- 
tions II: Stiff and Differential-Algebraic Problems, Springer Series in 
Computational Mathematics, 14, Springer, Berlin, 1991. 


An extensive survey of the theory of Runge-Kutta and linear multistep 
methods is found in 


» J.C. BuTcHER, The Numerical Analysis of Ordinary Differential 
Equations. Runge-Kutta and General Linear Methods, Wiley-Inter- 
science, John Wiley & Sons, Chichester, 1987. 


Satisfactory theoretical treatment of nonlinear systems of differential 
equations from the point of view of stiffness requires the development of a 
genuinely nonlinear stability theory which does not involve the rather du- 
bious idea of defining stiffness through linearisation based on the ‘frozen 
Jacobian matrix’. We close by mentioning just one concept in this di- 
rection — that of algebraic stability. Given a Runge-Kutta method with 


Butcher tableau 
c A 
bb? 


we define the matrices 
B = diag(bi, b2,...,b;) and M=BA+A™TB- bb". 


The method is said to be algebraically stable if the matrices B and 
M are both positive semidefinite, i.e., 2’ Ba > 0 and «' Ma > 0 for all 
z € R*. Algebraic stability can be seen to ensure that approximations to 
solutions of nonlinear systems of differential equations exhibit acceptable 
numerical behaviour. For example, the Gauss-Runge-Kutta methods 
discussed in the last section are algebraically stable. For further details, 
see, for example, 


>» K. DEKKER AND J.G. VERVER, Stability of Runge-Kutta Methods for 
Stiff Nonlinear Differential Equations, North-Holland, Amsterdam, 
1984. 
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Exercises 
Verify that the following functions satisfy a Lipschitz condition 
on the respective intervals and find the associated Lipschitz con- 
stants: 


(a) f(a,y) = 2yx*, x € [1,00) ; 
(b) f(x,y) =e7* tantly, x €[1,00) ; 
(c) f(a,y)=2y1+y?)*(1+e Fl), 2 € (—00, 00). 


Suppose that m is a fixed positive integer. Show that the initial 
value problem 


gS er. “gOSos 


has infinitely many continuously differentiable solutions. Why 
does this not contradict Picard’s Theorem? 
Write down the solution y of the initial value problem 


y=pyta, y(0)=1, 


where p and q are constants. Suppose that the method in the 
proof of Picard’s Theorem is used to generate the sequence of 
approximations y,(z),n = 0,1,2,...; show that y,(x) is a poly- 
nomial of degree n, and consists of the first n + 1 terms in the 
series expansion of y(a) in powers of «. 

Show that Euler’s method fails to approximate the solution 
y(x) = (4a-/5)°/* of the initial value problem y’ = y!/>, y(0) = 
0. Justify your answer. 

Consider approximating the same problem with the implicit 
Euler method. Show that there is a solution of the form y, = 
(C,h)>/4, n > 0, with Co = 0 and C, = 1 and C;,, > 1 for all 
n> 2. 

Write down Euler’s method for the solution of the problem 


—d5y, y(0)=0 


on the interval [0,1] with step size h = 1/N. Denoting by yn 
the resulting approximation to y(1), show that yy — y(1) as 
No. 

Consider the initial value problem 


y’ = xe” 


y=Inin(4+y’?), ce[0,tj, y(0)=1, 
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and the sequence (yn)N_9, N > 1, generated by the Euler 
method 


Ynt1 = YnthlnIn(4+y2) , n=0,1,...,N—-1, yo=l, 
using the mesh points z, = nh, n = 0,1,...,N, with spacing 
h=1/N. 


(i) Let T,, denote the truncation error of Euler’s method for 
this initial value problem at the point « = z,. Show that 
In| < h/4. 

(ii) Verify that 


ly(@n41) — Yn+i| < (1+ AL)|y(@n) — Yn| + h|Tr| 
for n = 0,1,...,N—1, where L = 1/(21n4). 
(iii) Find a positive integer No, as small as possible, such that 


—~y,| < 1074 
pmax, lyn) Yn| < 10 


whenever N > No. 
Define the truncation error T;, of the trapezium rule method 


1 
Yn+1 = Yn + gh (fn+4 “t fr) 


for the numerical solution of y’ = f(x,y) with y(0) = yo given, 
where fr, = f(@n, Yn) and h = &y41 — Xn. 
By integrating by parts the integral 


Tn+1 
i. (@ — tay) (0 — tn)y” (wae , 


n 


or otherwise, show that 
1 
T,=- h2 mw A 
phy (x) 


for some €,, in the interval (an, 2n+1), where y is the solution of 
the initial value problem. 
Suppose that f satisfies the Lipschitz condition 


|f(z,u) — f(z, v)| < Llu o| 


for all real x, u, v, where L is a positive constant independent 


12.10 
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of x, and that |y/"(x)| < M for some positive constant M in- 
dependent of x. Show that the global error e, = y(an) — Yn 
satisfies the inequality 


1 1 
lentil < len| + ght (len+1| + |en|) + DPM. 


For a constant step size h > 0 satisfying hL < 2, deduce that, 
if yo = y(xo), then 

1+4hL\" , 

t= IAL 


Show that the one-step method defined by 


2 
eke he M 
121 


Ynt1 = Yn + sh(ky + ke), 
where 
ky = f(€n, Yn); kg = f(@n +h, yn + hk1) 
is consistent and has truncation error 
Tr = gh? [fu(fo + fof) — 3 foo + 2fayf + fyyf?)| + O(h*). 


When the classical fourth-order Runge-Kutta method is applied 
to the differential equation y’ = Ay, where A is a constant, show 
that 


Yrs = (1+ RAF Zh?A? + EAPM + ThA M yn - 


Compare this with the Taylor series expansion of y(%n41) = 
y(@n +h) about the point « = xp. 
Consider the one-step method 


Yn+1 = Yn + ahf (an, Yn) + Bhf (En + Yh, yn + VAS (En, Yn) 


where a, ( and ¥ are real parameters and h > 0. Show that the 
method is consistent if, and only if, a+ 6 = 1. Show also that 
the order of the method cannot exceed 2. 

Suppose that a second-order method of the above form is 
applied to the initial value problem y’ = —Ay, y(0) = 1, where 
A is a positive real number. Show that the sequence (Yyn)n>o is 
bounded if, and only if, h < 5. Show further that, for such (, 


1 
ly(@n) — Yn| < gr ben » n>od. 
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Find the values of a and ( so that the three-step method 


Yn+3 + A(Yn+2 a Yn+1) — Un = hB(fn42 + fn+1) 


has order of accuracy 4, and show that the resulting method is 
not zero-stable. 

Consider approximating the initial value problem y’ = f(z, y), 
y(0) = yo by the linear multistep method 


Yn+1 + bYyn—1 + ayn-2 = hf (tn, Yn) 


on the regular mesh x, = nh where a and 6 are constants. 

(i) For a certain (unique) choice of a and b, this method is 
consistent. Find these values of a and 6 and verify that the order 
of accuracy is 1. 

(ii) Although the method is consistent for the choice of a and 
b from part (i), the numerical solution it generates will not, in 
general, converge to the solution of the initial value problem 
as h — 0, because the method is not zero-stable. Show that 
the method is not zero-stable for these a and 6b, and describe 
quantitatively what the unstable solutions will look like for 
small h. 

Given that a is a positive real number, consider the linear two- 
step method 


h 
Ynt2 — Wn = 3 [f (@n42,Yn+2) + 4f(@n41, Ynt1) + F (tn, Yn)]; 


on the mesh {2: tm, = % +nh,n = 1,2,...,N} of spacing 
h, h > 0. Determine the set of all a such that the method is 
zero-stable. Find a@ such that the order of accuracy is as high 
as possible; is the method convergent for this value of a? 
Which of the following linear multistep methods for the solution 
of the initial value problem y’ = f(x,y), y(0) given, are zero- 
stable? 


(a) Ynti — Yn = hfn, 
(b) Ynt1 + Yn — 2Yn—-1 = A( fr4a + fn t Jani). 
(c) Yn+1 — Yn-1 = ah fn41 re Afn + ta); 
(d) Ynt+1 — Yn = sh(3 fn Fini) 
(e) Yn4+1 — Un = h(5fn+A + 8fn — Test) 
For the methods under (a) and (c) explore absolute stability 
when applied to the differential equation y’ = Ay with A < 0. 


12.15 


12.16 


12.17 


12.18 
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Determine the order of the linear multistep method 


nse — (1+ a)yn4s + Yn = Zh [C3 ~ a) fuse + (13a) fl 


and investigate its zero-stability and absolute stability. 
Assuming that o(z) = 2? is the second characteristic polyno- 
mial of a linear two-step method, find a quadratic polynomial 
p(z) such that the order of the method is 2. Is this method con- 
vergent? By applying the method to y’ = Ay, y(0) = 1, where 
A is a negative real number, show that the method is absolutely 
stable for all h > 0. 

Consider the 6-method 


Ynt1 =Yn th (1 2 9) fn Ar 9 fn+1] 


for 6 € [0,1]. Show that the method is A-stable if, and only if, 
06> 1/2. 

Write down an expression for the Lagrange interpolation poly- 
nomial of degree 2 for a function «+> y(a), using the interpo- 
lation points %p, Tn41 = In +h and Lp42 = Tn + 2h, h > 0. 
Differentiate this polynomial to show that 


y (nga) = 5p (Byln42) — Aylenss) + yben)) + OUP), 


provided that y € C3[rn,%n+2]. Confirm this result by deter- 
mining the truncation error of the BDF2 method 

3Yn+2 a AYn+1 + Yn = 2hfn+2 : 
When the general two-stage implicit Runge-Kutta method is 


applied to the single constant-coefficient differential equation 
y’ = Ay, show that 

ky => (1 ate Ah(a12 = a22)|Ayn/A, 

ko = (1 WT Ah(a21 = a11)|Ayn/A, 


where A is the determinant of the matrix I — AhA with 


For the method defined by the Butcher tableau 
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4(3 — /3) 4 5 (3 — 2/3) 


g(3+ V3) |} 78+ 2v3) z 


1 


2 


Nie 


deduce that Yni1 = R(AR) yn, where 
1+ $\R+ GA7h? 


R(Ah) = 
— Z\R+ ZA?h? 


By writing R(z) in the factorised form (z+p)(z+q)/(z—p)(z-9), 
deduce that this Runge-Kutta method is A-stable. 


13 


Boundary value problems for ODEs 


13.1 Introduction 


In the previous chapter we discussed numerical methods for initial value 
problems in which all the associated side conditions for a system of 
differential equations are prescribed at the same point. Now we go on to 
consider problems where these conditions specify values at more than one 
point. Typically we require the solution on an interval [a,b], and some 
conditions are given at a, and the rest at b, although more complicated 
situations are possible, involving three or more points. 

We shall begin with the simplest case, of a second-order equation 
with one condition given at a and one at b. This problem is sufficient 
to introduce the basic ideas, and is of a type which arises quite often in 
practice. 

We then go on to discuss the shooting method for the solution of more 
general problems. 


13.2 A model problem 


The simplest two-point boundary problem involves the second-order dif- 
ferential equation 


-y"+r(x)y = f(x), a<a<b, (13.1) 
with the boundary conditions 
y(a)=A, y(b)=B, (13.2) 


where A and B are given real numbers. We shall assume that r and f 
are given real-valued functions, defined and continuous on the bounded 
closed interval [a, b] of the real line, and that 


riz) >0, a<ac<b. 
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The reason for this condition will appear later, in Theorem 13.4. 
We shall construct a numerical approximation to the solution on a 
uniform mesh of points 


tj =atjh, j=0,1,...,.n, h=(b-a)/n, n>2, 


so that x9 = a, Z, = b. The second derivative is approximated using 
the second central difference defined below. 


Definition 13.1 The central difference édy of y is defined by 
dy(xj) = y(aj + 3h) — y(ay — 5h). 
Higher-order differences are defined recursively by 


6™ +1 y (aj) = 6[5y(xj)] = 6" y(a; + 4h) — 6™y(x; — 4h). 


In particular, the second central difference may be written 
8 y(aj) = dy(aj + Zh) — by(aj — 5h) 
y(aj +h) — 2y(aj) + y(aj — hr). 


I 


Theorem 13.1 (i) Suppose that y € C*/x —h,x +h], i.e., that y has 
continuous fourth derivative on the interval [a —h,x +h]. Then, there 
exists a number € in (w—h,a +h) such that 
6° y(x) 
he 
(ii) Suppose that y € C°[x — h,a +h]; then, there exists a number n 
in (wc —h,a+h) such that 
8° y(a) 
he 


= y""(x) + agh?y(€). 


= yl! (a) + Eh? y™(a) + xeght y(n). (13.3) 


Proof (i) Taylor’s Theorem shows that there exist numbers €; and 9 in 
the intervals (a — h,x) and (x,a +h), respectively, such that 


y(a — h) y(x) — hy’ (a) + gh?y"(a) — Zh8 yx) + ghty™(&), 


y(ath) = y(x)+hy!(x) + sh? y" (a) + hoy (a) + ggh4y(&). 
(13.4) 


l| 


Since y”” is continuous on [x — h,x +h], there is a number € in (€1, £9), 
and thus also in (a — h,x +h), such that 


3(y"(E1) + y"(E)) = 96). 
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The required result is now obtained by adding the two equations (13.4) 
and dividing by h?. 

(ii) The proof is completely analogous, and is left to the reader as an 
exercise. (See Exercise 1.) 


We can now use the central difference approximation to construct 
the numerical solution. Writing Y; for the numerical approximation to 
y(x;), we approximate the differential equation by 

6°Y; 


 pp2 +975¥j = f5, j=1,2,...,.n—-1, (13.5) 


where we have used the notation r; = r(x;), fj = f(a;). Now, (13.5) 
is a system of n — 1 linear algebraic equations for the n — 1 unknowns 
Y;, j = 1,2,...,n —1, with the boundary conditions specifying the 
values of Yo and Yj, 

Yo=A, Y,=B. (13.6) 


The system may be written in matrix form as 


MY =qQ, 
where Y,g € R”~! and, for n > 4, the matrix M € RO@-D*@™—D is 
tridiagonal. Here Y = (Y1,...,Yn—1)", the nonzero elements of M are 
My =2/R? +75, Myj-1 = Mjju1 =—1/P’, (13.7) 


and the elements of the column vector g on the right-hand side are 
n= fit A/h’, 9n—1 = fn—1 + B/h’, gj = fj, j = 2,3,...,n-—2. 


Note how the known boundary values Yo and Y,, have been transferred to 
the right-hand side, and appear in the first and last elements of g. The 
solution of this system is very easy, using the algorithm for tridiagonal 
matrices described in Section 3.3. Using the fact that r(x) > 0, we see 
that the off-diagonal elements of MV are negative, the diagonal elements 
are positive, and in each row the diagonal element is at least as large 
as the sum of absolute values of the off-diagonal elements. Theorem 3.4 
implies that no row interchanges are needed in the calculation, and that 
the matrix M is nonsingular. The calculation is therefore very straight- 
forward and efficient, and requires very little computational time, even 
for a mesh which may contain several hundred points. 
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13.3 Error analysis 


Having obtained the numerical solution we must now analyse its accu- 
racy. In the same way as for initial value problems, we begin by finding 
the truncation error. 


Definition 13.2 The truncation error of the central difference ap- 
proximation to the problem (18.1) is 


5° y (25) 
T; =-— 
where y is the exact solution of (13.1), (13.2). 


Preys) FFs j=1,2,...,n—-1, 


Theorem 13.2 Suppose that the solution y to the boundary value prob- 
lem (13.1), (18.2) has a continuous fourth derivative on [a,b]. Then, 
the truncation error may be written 


Tj = —Hh’y(&), (13.8) 


for some value of €; in the interval (xj;~1,%j+41), 9 =1,2,...,n—1. The 
truncation error is bounded by T, where 


IT;|<T = GA’°Mh, g=1,2,...,n—1, 
and 
M4 = max |y'”(x)|. (13.9) 
x€ [a,b] 


Proof The expression for T; follows from the substitution of the expres- 
sion for 67y(xj;) given by Theorem 13.1 into the definition of T;, and the 
use of the fact that y is the solution of the differential equation. The 
proof of the bound for 7; is then immediate; since y” is known to be 
continuous on [a, 6] it is bounded on [a,b], so M4 exists. 


In order to simplify writing, we define 


67u; 
D(uj) = —S 5" + ryuy, j =1,2,...,n—-1, 
for any set of real numbers {uo, u1,...,Un}. The global error in the 


numerical solution is defined by 


ej; = y(x;) — Y; j=0,1,...,n. 
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Now, y(z,;) and Y; satisfy 
L(y(x;)) = fy+T;, g=1,2,...,.n-1, 
LY;) = fy, j=1,2,...,.n-1, 
from the definition of truncation error and (13.5); hence, by subtraction, 
L(e;) = T; g=1,2,...,n—1, 


with the boundary conditions eg = e, = 0. We must now use the bound 
on T; to derive a bound on the error e;. This will be achieved by means 
of the following theorem. 


Theorem 13.3 (Maximum Principle) Let a;,b;,c;, 7 = 0,1,...,n, 
be positive real numbers such that b; > a; + c,;, and suppose that uj, 
j=0,1,...,n, are real numbers such that 

—ajuj—1 + bjuj — cjuj41 <0, j=1,2,...,n—-—1. 
Then, u; < K, 7 =0,1,...,n, where K = max{uo, un, 0}. 
Proof Let u, = max{uo, u1,...,Un}; then if r = 0, r = n, or u, < 0 


the result is trivial. Suppose then that 1 <r <mn-—1, and that u, > 0. 
Since u, is the maximum of the u;, we know that 


Ur = Ur-1; Ur = Ur+1- 
Hence 
bpUr < ApUp—1 + CrUr+1 
SL Gp Up + CpUy 


IA 


bp Ur. ’ 


since u, > 0. This means that equality holds throughout, so that u,_1 = 
Ur = Up41- We can then apply the same argument to both u,_, and 
Ur+1, continuing until we find that either u, = uy, or u, = Uo. Thus, in 
this case Up = Un = max{uo, U1,..-,Un}, as required. 


Theorem 13.4 Suppose that the solution y of the boundary value prob- 
lem (13.1), (18.2) has a continuous fourth derivative on [a,b], and that 
Y;,j =0,1,...,n, as the solution of the central difference approximation 
(13.5), (13.6). Then, 


pax y(@3) — Yj] S gph*(b— a)" Ma. (13.10) 
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Proof Let e; = y(x;) — Y;. We have already seen that L(e;) = T), 
j=1,2,...,n—1. Defining 


y; = C {(2j —n)?h? —n7h?}, j=0,1,...,n, (13.11) 
where C is a constant, we see that 
L(pj) = —C{(2j -2-—n)? — 2(2j —n)? + (27 +2-n)?} 4 rjy; 
—8C + 59; , j=1,2,...,.n-—1. 
Hence 
Le; + pj) = T; — 8C + r59;, j=1,2,...,.n-1. 


If we choose C = T/8 with T = 4h? Mu, we see that L(e; + y;) < 0, 
since |T;| < T, r; > 0 and y; <0, and L satisfies the conditions of the 
Maximum Principle. Now, 


€0 + Yo = En + Yn =0, 


so that, according to Theorem 13.3, e; + y; < 0 for 7 = 0,1,...,n. 
However, —C'n”h? < yj <0, so we have the result 


6; < Cn?h? = 3(b—a)°T = xgh?(b— a)?Ma, j=0,1,...,n. 
By applying the same argument to L(—e,; + y;) we find that 
—e; < ggh*(b-a)?Ma, 9. Oy Tysasetis 


Combining these upper bounds for e; and —e; gives the required result. 


The function ~ defined by (13.11) is called a comparison function. 
An alternative proof of Theorem 13.4, based on the properties of mono- 
tone matrices, can be given by using the result in Exercise 2. Notice 
that the condition r(x) > 0 is used in the application of the Maximum 
Principle in the above proof. 

This theorem shows that, provided the solution y has a continuous 
fourth derivative, the numerical method is convergent, that is 

max |y(o;)—¥j| 0 as n+ 00 
(or, equivalently, as h = (b—a)/n — 0). This means that we can obtain 
any required accuracy by choosing n sufficiently large. 
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13.4 Boundary conditions involving a derivative 


The same differential equation (13.1) may be associated with boundary 
conditions involving the first derivative of the solution. Suppose, for 
example, that we are given real numbers a > 0, A and B. Consider the 
differential equation (13.1) together with the boundary conditions 


y' (a) — ay(a) =A, y(b) = B. (13.12) 


The condition at x = a may be approximated in various ways; we shall 
introduce an extra mesh point x_; outside the interval and use the 
approximate version 
Y,-Y-1 
——_ -aYyj=A. 
2h ‘ 
This gives 
Y_4 = Y, 7s 2haYo —2hA. 


Writing the same central difference approximation (13.5) as before, but 
now for 7 = 0,1,...,n—1, we can eliminate the extra unknown Y_, 
from the equation at 7 = 0 to give 
2(1+ ah) 

| po | 
Together with (13.5), for 7 = 1,2,...,n—1, we now have a system of n 
equations for the unknowns Y;, 7 = 0,1,...,2—1. There are one more 
equation and one more unknown than before, but the new matrix is 
still tridiagonal, and also diagonally dominant because of the condition 
a> 0. The computation is again very straightforward. 


Theorem 13.5 Suppose that y € C3[2 —h,x +h]; then, there exists a 
real number x in (w—h,x+h) such that 
y(a + h) — y(a — h) 
2h 


= y! (x) + hy" (x). (13.13) 


Proof Taylor’s Theorem shows that there exist y; € (a — h,x) and 
x2 € (a,x +h) such that 
y(x — h) y(x) — hy'(x) + sh?y"(x) — shy (x1), 
y(e+h) = y(x) + hy'(x) + ph?y'"(a) + ghey’ (x2). 
We subtract the first equality from the second, and the result follows as 
in the proof of Theorem 13.1. 


l 
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Note that the approximation to y/(a#) at « = x may be written 
5 [dy(xo 5h) + dy(xo sh)| 
h ; 
For 7 = 1,2,...,n — 1, we define the truncation error T; as in Def- 


inition 13.2. In addition, since we shall now also incur an error in the 
approximation of the boundary condition at x = a, we define 


Ty = [POE | lO) — Fevlt) — fot ZA. 


The aim of our next result is to quantify the size of the truncation error 
in terms of the mesh size h. 


Theorem 13.6 Suppose that the solution y to the boundary value prob- 
lem (18.1), (18.2) has a continuous fourth derivative on the closed in- 
terval [a — h,b]. Then, the truncation error of the central difference 
approximation to (13.1) with boundary conditions (13.12) may be writ- 
ten 

ho= ASG, 71 2.c4n-1, 

Ty = ~qgh?y"(Eo) — shy'(x), 


dL. 
3 
for some value of €; in the interval (x;-1,%;41), 1 <j < n—1, and 
some value x in the interval (w_1,21) where x_, =a—h. 


Proof For j = 1,2,...,n—1, this is the same result as in Theorem 13.2. 
When j = 0, we find that 


[AE ro] (0) — Fault) — fo + ZA 


_ _ yh) = a0) + =P) + oyy(0) — (0) 


[2 = ory(0) Al 


= —Ah?y™ (£0) = 2 En?y'” (x), 


where we have used Theorem 13.5. 


To 


Theorem 13.7 Suppose that the solution y of (13.1) with the boundary 
conditions (13.12) has a continuous fourth derivative on the interval 
[a—h, b]; then, the numerical solution obtained from the central difference 
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approximation satisfies 


stnaxe [y() — Yj <b? {hb — a)? Ma + $b — a) Mg} 
SISn 


Proof The proof is very similar to that of Theorem 13.4, but requires 
the use of a more complicated comparison function y,;. Let us define 


67u; . 
L*(u;) = Seti j=1,2,...,n—-1, 
: 2(1+ah 2 
L (uo) = | ( 72 ) } r| Uo he U1; 
for any set of real numbers {ug, u1,..-,Un}, and let 


yp) =Cirh? + Djh+E, 7=0,1,...,n, 


where C’,D and FE are constants to be determined. Then, with e; = 
y(a;) — Y;, as in the proof of Theorem 13.4, we see that 


L*(e;) =T; #=0,1,.-.,n—-1. 
A simple calculation shows that 
E'(g;) = =2C 93:05 5 j=1,2,...,n—-1, 
L*(yo) = -2C—-2D/h+ [2a/h+rolL£. 
Hence 
E*(e;+ oj) = —igh?y'(&j)-2C +159;, f=1,2,....n-1, 
L*(eo+ 0) = —dh?y(&o) — zhy!”(x) 


—2C —2D/h+ [2a/h+roJE. 
If we now choose 


C=h?M1, D=$h?M3, E=-C(b-a)?—D(b-a), 


it is easy to check that 
0, 9 = 0; 1s sey, 
0, j=0,1,...,n-1. 


pi Ss 

L*(e;+9;) < 
The Maximum Principle then applies, and we deduce that 

ej + pj < max{ent+ Yo,ent+%n,0}, g=0,1,...,n. 


We see at once that e, = yn = 0 and yo < 0, but in this case eg is not 
zero. Therefore, all we can conclude for the moment is that 


ej +p; < max{eo + Yo, 0}, j=0,1,...,n. (13.14) 


370 18 Boundary value problems for ODEs 
In particular, 

er +41 < max{ep + Yo, 0}. (13.15) 
However, L* (eo + Yo) < 0; thus, by the definition of L* (eo + yo), 


2 
1 + ah) + h2ro 


eo + Po S 9 (e1 + y1). 


On writing 6 = 2/(2(1+ ah) + h?ro) and noting that, since a > 0 and 
T9 > 0, we have 0 < 6 < 1, it follows that 
€o + Yo < 6(e1 + 41). (13.16) 
Inserting this inequality into the left-hand side of (13.15), we find that 
eo + Yo < max{4(eq + Yo), OF. 


If eg + Yo were positive, this inequality and the fact that 0 < 6 < 1 would 
imply eg + yo < 0, leading to a contradiction. Therefore, eg + yo < 0. 
Returning with this information to (13.14), we conclude that e;+y; <0 
for 7 = 0,1,...,n, and the rest of the proof then follows as in the proof 
of Theorem 13.1. 


13.5 The general self-adjoint problem 


The general self-adjoint boundary value problem is 


< ($4) + r(x)y = f(x), a<a<b, (13.17) 


where r and f are real-valued functions, defined and continuous on {a, 0], 
p is a real-valued continuously differentiable function on [a,b], r(a) > 0 
and p(a) > cg > 0. We shall consider only the case where the boundary 
conditions prescribe the values of y at each end, 


ya)=A, y(b)=B. (13.18) 


The central difference approximation to the equation (13.17) may be 
written 
6(p,; OY; i 
a wri’ es ae j=1,2,...,.n—-1, 
or, in detail, 


Pj+i/2(Vj41 — ¥5) — pj—1/2(¥j — Yj-1) 
h2 


+ry¥j= fj, (13.19) 
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for 7 = 1,2,...,n, and is supplemented by the boundary conditions 
Yo=A, Y,=B. (13.20) 


It is easy to see that this represents a system of linear equations for 
the unknowns Yj, Y2,...,Yn—1, and that the matrix of the system is 
tridiagonal and diagonally dominant, just as it was in the special case 
(13.1), which corresponds to p(x) = 1. The solution of the system is 
therefore a very simple matter. 

Next, we consider the error analysis of the difference scheme (13.19), 
(13.20). We begin by quantifying the size of the truncation error 


6(p; 6y(x; ; 
Pose tea ut Ds pels) Soh j=1,2,...,n-1, 


in terms of the mesh size h. 


Lemma 13.1 Suppose that p € Cla, b] and y € Cla, b]. The truncation 
error T; of the central difference approximation (13.19) then satisfies 


In|<T= dn max. {|(ey')"(a a)| + |p'y(x)| + 2|py(x)|}, 
for 7 =1,2,...,n—1. 


Proof By expanding in Taylor series as we have done before, we find 
that 


pysiyaly(ajar) — y(@y)) = ysryalhyjyayo + sgh? y”(Er)I, 
pj—rjaly(vy) — y(@j—1)] = py—apalhyj_ayo + agh?y’” (E2)1, 


where 1 € (aj, 741) and €2 € (xj_1,2;). The first term in the difference 
of these expressions gives, in the same way, 


Alpj41/2y' (@j41/2) — Pj-12y'(@j-1/2)] = h[h(py’)! (xj) + sh? (py’)” (Es)] 
where €3 € (#j-1/2,%j+41/2). For the other term we can write 
aah? Ipierpoy’” (1) — pj-1/2y"” (€2)| 
= = yh I(pj41/2 — Pj v2)y" "(1) + pj-raly”” (E1) = y” (€2)] | 
< gh? {lho (Es)y” (Ex) + [pj—122hy"(é5) |} 


since |€; — €2| < 2h. Here, €4 € (%j-1/2,%j41/2) and &5 lies between £; 
and €. The required bound follows immediately. 
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As in the proof of Theorem 13.4, we can now derive a bound on the 
global error in the numerical solution in terms of the truncation error 
by using the Maximum Principle. The only difficulty in extending that 
theorem to the more general self-adjoint problem lies in the construction 
of a comparison function corresponding to (13.11). The general case 
requires some detailed analysis, which can be simplified under certain 
conditions on the function p, for example if p is monotonic. 


Lemma 13.2 Suppose that p andr are continuous functions defined on 
[a,b], p ts monotonic increasing on [a,b], p(z) > co > 0, r(x) > 0, and 
define 


6(p du; 
L(uj) = — POD rss, j =1,2,...,.n—1, 
for any set of real numbers {uo,1,...,Un}. Further, let 
yj = O(7? —n*)h’, j=0,1,...,n, 


where C' is a positive constant. Then, 


L(yj) <—-2e0C, gj =1,2,...,n—1. 


Proof It follows from the definition that 


L(pj) = —pyj41/2C (25 + 1) + pj-12C (29 — 1) + C(9? — n?)h? rj 
= —C[(pj+1/2 + Pj—1/2) + 25 (P5412 — Pj—1/2) +h? (n? — 97)r; 
< —2C, 

for 7 = 1,2,...,n—1, as required. 


Note that we have imposed various conditions on the problem, which 
are usually necessary, though some can be slightly relaxed. The condi- 
tion in this lemma, that p should be monotonic increasing on [a, }], is 
only needed to simplify the subsequent proof. The main result is true 
much more generally. We leave it as an exercise to derive the same result 
under the assumption that p is monotonic decreasing on [a, 6]. 


Theorem 13.8 Suppose that p and r are continuous functions defined 
on [a,b], p is monotonic increasing on [a,b], p(x) > co > 0, r(x) > 0. 
Assume further that the solution y of (18.17), (18.18) has a continuous 
fourth derivative on [a,b], that p has a continuous third derivative, and 
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that Y;, 7 =0,1,...,n, 18 the solution of the central difference approxi- 
mation (18.19), (13.20). Then, with T as in Lemma 13.1, 


1 
max |y(a;) — Yj] < sh 7. (13.21) 


Proof The proof of this theorem follows that of Theorem 13.4, using the 
bound from Lemma, 13.1 on the truncation error and the comparison 
function y; from Lemma 13.2. The details are left as an exercise. 


13.6 The Sturm—Liouville eigenvalue problem 
Suppose that r is a real-valued function, defined and continuous on the 
closed interval [a, b], p is a real-valued function, defined and continuously 
differentiable on [a, 6], and r(x) > 0, p(a) > co > 0 for all x € [a,b]. The 
differential equation 


d d 
an (4) zt) Er(z)y=rAy, ax<ac<b, (13.22) 
with homogeneous boundary conditions y(a) = y(b) = 0, has only the 
trivial solution y = 0, except for an infinite sequence of positive ezgenval- 
ues \ = Am, m =1,2,.... We shall now consider a numerical method for 
finding these eigenvalues and the corresponding eigenfunctions, y(m)(2), 
m=1,2,.... 


In the simple case where p(x) = 1 and r(x) = 0 the solution to 
this problem is, of course, A», = [mm/(b— a)]?, yom)(a) = Asin mat, 
m=1,2,..., where A is a nonzero constant and t = (# — a)/(b— a). 


Using the same finite difference approximation as in the previous sec- 
tion, we obtain the equations 


; Yj441—Y;) — pj Y; — Y;_ 
P7541/2( g+1 ae 1/2( j j 1) rjVj A, 


j=1,2,...,n—-1. 
Together with the boundary conditions Yo = Y,, = 0, this shows that A 
is an eigenvalue of a symmetric tridiagonal matrix M whose entries are 


_ Pj+i/2 + Pj-1/2 


Mj; = he ETs , 1l<j<n-l, 
Pj-1/2 2 Pj4+1/2 . 
Myj1=-",2<jsn, Myr =-25", 1<jsn-l, 
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and the approximate function values Y; are the elements of the corre- 
sponding eigenvector. This algebraic eigenvalue problem is easily solved 
by the method described in Chapter 5. 

The boundary value problems which we have discussed so far have all 
had a unique solution. The eigenvalue problem (13.22) has an infinite 
number of solutions, and the mesh used in the numerical computation 
has to be chosen to adequately represent the eigenfunctions required — 
the computation can obviously only find a finite number of them. The 
matrix M has n — 1 eigenvalues and eigenvectors and, as we shall see, 
it will normally give a good approximation to the first few eigenvalues, 
Az, A2,..-, and a much less accurate approximation to An_1. 

To analyse the error in the eigenvalue we proceed as before, by defining 
the truncation error 
Pj+1/2(Yi+1 — Ys) — Pj-1/2(¥j — Yi-1) . 

h2 + T5Yj — AY; 5 
j=1,2,...,n-1, 


Tj = 


where y; = y(x;). These equations can now be written 


(M—AI)Y = 0, 
(M-Aly = T, 
where 
YS ice as 
y = (Y1,---,Yn-1)", 
TS a a) 


Theorem 5.15 of Chapter 5 applies to this problem, and shows that one 
of the eigenvalues, A,,,, of the matrix M satisfies 


[Am —Am| < ||Tll2/llylle- (13.23) 
In the simpler case where p(x) = 1 and r(a) = 0 the truncation error is 
T; =—ph?y(&), Cp Sai Pe) 


so the numerical method has evaluated the eigenvalue with error less 


than 


1/2 —1/2 
n-1 / n-1 / 


wh? 4 oly(&)P doles)? 


j=l j=l 
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Since the mth eigenfunction y(,,) is given by 
(x) = Y(my(x) = sin(ma(a@ —a)/(b—a)), —  € (a,b), 


we see that 


MT 


b-—a 


ya) =| J vw. 2 € (a,b). 


This shows that, for example, the error in the tenth eigenvalue, corre- 
sponding to m = 10, is likely to be about 10 times larger than the error 
in the first eigenvalue; more generally, to evaluate higher eigenvalues of 
the equation will require the use of a smaller interval h. 


13.7 The shooting method 


The methods we have described for the linear boundary value problem 
may be extended to nonlinear differential equations. We shall not discuss 
how this is done; instead, we shall describe an alternative approach, 
called the shooting method. We shall consider the nonlinear model 
problem 


y'=f(a,y), a<a<b, y(a)=A, y(b)=B, 


where we assume that the function f(x,y) is continuous and differen- 
tiable, and that 


—(a,y) >0, a<a<b, yeER. 


The central idea of the method is to replace the boundary value prob- 
lem under consideration by an initial value problem of the form 


Y =e). BAe Sb; OSA, ye) =t, 


where t is to be chosen in such a way that y(b) = B. This can be thought 
of as a problem of trying to determine the angle of inclination tan~!t 
of a loaded gun, so that, when shot from height A at the point x = a, 
the bullet hits the target placed at height B at the point x = b. Hence 
the name, shooting method. 

Once the boundary value problem has been transformed into such an 
‘equivalent’ initial value problem, any of the methods for the numerical 
solution of initial value problems discussed in Chapter 12 can be applied 
to find a numerical solution. Thus, in particular, the costly exercise of 
solving a large system of nonlinear equations, arising from a direct finite 
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difference approximation of the nonlinear boundary value problem, can 
be completely avoided. 
If we write 


y'(a) =t, 
a numerical solution of the differential equation with the initial condi- 
tions y(a) = A,y’(a) = t can be obtained by any of the methods of 


Chapter 12. This solution will depend on t, and we may write it as 
y(a;t). In particular the value at x = b will be a function of t, 


y(0;t) = v(t). (13.24) 


The solution of the nonlinear boundary value problem therefore reduces 
to the determination of the value of ¢ for which the boundary condition 
at x = bis also satisfied, 7.e., 


u(t) -B=0. 


There are a number of well-known methods for the solution of equations 
of this form; Newton’s method is an obvious example. Generally, we 
shall not, of course, have a closed form expression for the function ~(t), 
in general, but this is not necessary; all that is needed is a numerical 
algorithm to calculate the value of w(t) for a given value of t, and this we 
have. To use Newton’s method we shall also need to be able to calculate 
the value of ~’(t), and this is easily done. 

The function y(a;t) is defined, for all t, as the solution of the initial 
value problem 


y" (a;t) = f(z, y(a;#)), y(ast)=A, y'(a;t)=?, (13.25) 


Mu 


where ‘ and ” indicate differentiation with respect to the variable z. 


We can differentiate these throughout with respect to t, giving 


/ 
© of"(a5t) = Fale. wleit)) eleit), 7H (a;t) = 0, OW (a8) =1. 
Writing 
w(a,t) = ae t) 
3 ot ? 2 


and interchanging the order of differentiation, we find that w(a;t) may 
be obtained as the solution of the initial value problem 


w"(ast) = wast) Ae, y(eio), +i(a;2)=0,. w'(ajt) <1. 
(13.26) 
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By virtue of (13.24), the required derivative is then given by 
p(t) = w(d,t). 


To implement this method, it is convenient to solve the two initial 
value problems, (13.25) and (13.26), in tandem, by writing them as a 
system of four simultaneous first-order differential equations: 


uj(a;t) =  uo(2;t), 
ug(a;t) = f(x,u(2;t)), 
(13.27) 
ig eet): =, wales) 3 
u(a}t) = us(#;t) PL (x,u1(a;¢)), 


with the initial conditions 
uy(a;t) =A, uo(a;st)=t, us(a;t)=0, uy(a;t) =1, 


where ui(z;t) denotes y(x;t), ug(x;y) signifies w(x; t), and ug and ug 
are defined by ug = ui, = y’ and ug = uy = wv". 

Having obtained a numerical solution of this system of differential 
equations for some chosen value of t, t) say, Newton’s method gives, 
as the next, improved, value for ft, 

b+) — 4(6) w(t) _—B (k) ur(b, t) —B 


= k= 05 To: 
71) rT CX) 


iterating until a certain number of decimal digits have converged. 


Theorem 13.9 Suppose that a numerical algorithm for the solution of 
the system of differential equations (13.27) gives the result v;,;(t), the 
numerical approximation to u;(v;;t), 1 =1,2,3,4, 7 =1,2,...,n, where 
the error satisfies 
1lsjsn 

for some s > 0; here C(t) depends on bounds on the derivatives of y and 
f(z,y), and on t. Suppose also that the Newton iteration is performed 
until 


lurn(t™) — Bl <e. 


Then, v1,3 (t()) is an approximation to the solution of the boundary value 
problem which satisfies 


max |y(x;) — v1,;(t*)| < 20(0¢)h? + €. 
1<j<n 
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Proof Suppose that the solution of the system of differential equations 
with t = t™) is u;(a;t), i = 1,2,3,4, and the corresponding numerical 
solution is u;;(¢), i= 1,2,3,4, 7 =1,2,...,n; then 


nant) — 9,50) |2Ce™ yn 
Moreover |v1,,(t™)) — B| < ¢, so that 
[ua (b; t) — BI [ur (bt) — vn (t™)| + lor n(t™) — BI 
C(t he +e. (13.28) 


IN IA 


Let us write n(a;t) = y(x) — u1(x;t); by subtraction we see that 


n(x; t) y’ (x) — uy (2; t) 
= PLO S)) = 7 ta (0) 
= nest) A (egteo), 


where €(a;t) lies between u(x; ¢) and y(z). 

Suppose that 7’(a;t) > 0; since 7(a;t) = 0, there is some interval to 
the right of a in which n(2;t) > 0. Then, either 7(x;t) > 0 for the whole 
of (a, 6], or there is a value c such that a < c < b and n(ct) = 0. In 
the latter case, 7/(x;t) must vanish at some point « = d between a and 
c. However, in the interval [a,d], n(a;t) > 0 and Of /Oy > 0, so that 
n’’(x;t) > 0. Consequently, in the interval [a, d], 7’(x;t) > 7'(a;t) > 0, 
and we have a contradiction. Thus, 7(2;t) > 0 for alla < « < b. It then 
follows that 7’(x;t), and hence also 7/(2;t) are positive on the whole 
interval [a,b], which means that 7 + n(x;t) is monotonic increasing 
on [a,b]. If we had begun with the assumption that 7(a;t) < 0 an 
analogous argument shows that x +> 7(2;t) would have been monotonic 
decreasing on [a,b]. It is left to the reader to discuss the trivial case 
when 7'(a,t) = 0. 

In any case, 

In(ast)| < |n(;t)|, asad, 
and therefore, since y(b) = B and recalling (13.28), 
ly(x) — ur(a3t)| < |B ur(bst)| < CUM)Ao +e. 

Thus, finally, 
ly(as) — 1 (¢)| lee) nee”) + ape) —rigle™)| 
C(t)rs+e4+Ce)ne, 7 =1,2,...,n, 


IN IA 


and hence the desired bound. 
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Fig. 13.1. The function tb w(t). 


The shooting method is an example of a technique which can be ap- 
plied to much more general problems, including systems of differential 
equations of any order, with some boundary conditions specified at each 
end of the interval. The condition Of /Oy > 0 is restrictive, and may 
often not be satisfied in practical problems. Note that if f(a, y) is linear 
in y, of the form f(x,y) = r(x)y+g (x), this condition is the same as the 
condition r(a) > 0 imposed on the model problem in (13.2). Perhaps 
the simplest example of a nonlinear two-point boundary value problem 
is 

y=y’,  y(-1) = y(1) = 1, (13.29) 


where Of /Oy = 2y, which does not satisfy the condition 0f/Oy > 0, 
y € R. In fact, problem (13.29) has two solutions, one of which is 
positive, and the other takes negative values around x = 0. 

Figure 13.1 shows a graph of the corresponding function t + w(t) 
defined in (13.24), over the range —12 < t < 0; outside this range 
the function w tends quite rapidly to +oo. This shows clearly the two 
solutions to the boundary value problem, given by the two values of t at 
which w(t) = 1. The two solutions are displayed in Figure 13.2. 

For the positive solution it is reasonable to suppose that the above 
proof could be modified so that it requires only that Of /Oy is positive 
for values of y in the neighbourhood of the solution, and the error bound 
would then hold, at least if h and € were sufficiently small. The analysis 
of the error of the other solution, which takes negative values, will be 
much more difficult, as our proof relies heavily on the monotonicity of 
solutions of the linearised equation. 
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Fig. 13.2. The two solutions of the nonlinear boundary value problem (13.29). 


13.8 Notes 


The following books are standard texts on the subject of numerical ap- 
proximation of boundary value problems: 


» H.B. KELLER, Numerical Methods for Two-Point Boundary Value 
Problems, Reprint of the 1968 original published by Blaisdell, Dover, 
New York, 1992. 

» H.B. KELLER, Numerical Solution of Two-Point Boundary Value 
Problems, SIAM, Philadelphia, fourth printing, 1990. 


A more recent survey of the subject is found in 


» U.M. AscHER, R.M.M. MATTHEIJ AND R.D. RUSSELL, Numeri- 
cal Solution of Boundary Value Problems for Ordinary Differential 
Equations, Corrected reprint of the 1988 original, Classics in Applied 
Mathematics, 13, SIAM, Philadelphia, 1995. 


In practical implementations of the shooting method into mathematical 
software (see, for example, Appendix A in the Ascher et al. book), the 
interval [a,b] is subdivided into smaller intervals on each of which the 
shooting method is applied with appropriately chosen initial values. The 
‘initial’ conditions on the subintervals are then simultaneously adjusted 
in order to satisfy the boundary conditions and appropriate continuity 
conditions at the points of the subdivision. From the practical viewpoint, 
this extension of the basic shooting method considered in this chapter 
is extremely important: the various difficulties which may arise in the 
implementation of the basic method (such as, for example, growth of the 
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solution to the initial value problem over the interval [a,b], leading to 
loss of accuracy in the solution of the equation w(t) = B) are discussed, 
for example, in Section 2.4 of the 1992 book by Keller. 

Sturm-Liouville problems originated in a paper of Jacques Charles 
Francois Sturm: Sur les équations différentielles linéaires du second 
ordre, J. Math. Pures Appl. 1, 106-186, 1836, in Joseph Liouville’s 
newly founded journal. Sturm’s paper was followed by a series of articles 
by Sturm and Liouville in subsequent volumes of the journal. They ex- 
amined general linear second-order differential equations, the properties 
of their eigenvalues, the behaviour of the eigenfunctions and the series 
expansion of arbitrary functions in terms of these eigenfunctions. An ex- 
tensive survey of the theory and numerical analysis of Sturm—Liouville 
problems can be found in 


» JoHN D. PRyYcE, Numerical Solution for Sturm—Liouville Problems, 
Oxford University Press Monographs in Numerical Analysis, Claren- 
don Press, Oxford, 1993. 


See also Section 11.3, page 478, of the Ascher et al. book cited above. 


Exercises 
13.1 Suppose that y € C®[a — h,x + h]; show that there exists a real 
number 77 in (a — h,a +h) such that 
6°y(x) 
he 
13.2 Use Theorem 3.6 to show that the matrix M in (13.7) is mono- 


tone. Use the result of Exercise 4 to show that ||M~1lloo < ¢. 
13.3. On the interval [a,b] the differential equation 


—y" + f(2)y = g(2) 


= y!" (a) + qgh?y(a) + gegh*y"(n) . 


is approximated by 


67y; 
oan + B-1yj-1 + Boys + Piyj+i = B-19j-1 + Bog; + Pigj41 
where 3_1, Go and (3, are constants. Assuming that the solution 
y has the appropriate number of continuous derivatives, show 
that the truncation error of this approximation may be written 
as follows: 
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(i) if G@-1+ 60+ (1 #1, then 
T; = (8-1 + Bo + Br)y" (23) + 20h, 


where IZ” | < (B—1| + |A1|) Ms; 


(ii) if By + Bo + By =land B41 Z A, awn 
Tj = (Ar me B_1)hy!"(x;) + Zn 


where Z| < [3 (G-1] + Ail) + 75) Ma; 


(iii) if 6-1 + Go + A: =1, 61 = G_1 and 6, # %, then 
T; = (Bi — hy (ay) + ZH, 


where [Z| < [BlPal + ag] Me: 


360 


(iv) if B41 = By = a and Bo = 2, then 
vi 4 
Tj = gighty"(aj) + ZR, 
(4) 
where |Z;""| < goagq Ms. 

13.4 The approximation of Exercise 3 is used, with the values 3, = 
B_1 = 1/12, 8 = 5/6. Use Taylor’s Theorem with integral re- 
mainder (Appendix, Theorem A.5) to show that the truncation 
error of this approximation may be written 


h 
=f Goya) +5) as, 
—h 
where 


G(s) = (h— s)°/5!— Eh?(h—s)3/3!, O<s<h, 


with a similar expression for —h < s < 0. Show that G(s) < 
0 for all s € [—h,h], and hence use the Integral Mean Value 
Theorem to show that the truncation error can be expressed as 


for some value of € in (x; —h,x; +h). 


13.6 
13.7 
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Suppose that the solution of (13.1), (13.2) has a continuous sixth 
derivative on [a,b], and that Y; is the solution of the approxi- 
mation used in Exercise 4. Show that 


ly(x;) Y3| < seal" (b a)*Me , J=0,-.-,7, 
provided that 
h?r(xj) < 12, ps Vleet s 


Complete the proof of Theorem 13.7. 
Show that the solution of the boundary value problem 


—y/+a*%y=0, y(-1)=1, y(1)=1, 
is 
oe cosh ax 
aa cosha © 


Use the identity 
cosh(x + h) + cosh(a — h) = 2cosha coshh 


to verify that the solution of the difference approximation (13.5) 
to this problem is 
cosh 0x; 
7 cosh’ ’ 
where 
8 = (1/h) cosh '(1 + $a7h?). 
By expanding in Taylor series, show that 


Yj = y(x;) + #h’a° (cosh az sinh a — x sinh az cosh a)/(cosh a)? 
+ O(h*). 

Verify that this result is consistent with Theorem 13.4 when h 

is small. 

Carry out a similar analysis as in Exercise 7 for the boundary 

value problem 


-y"-ay=0, y(0)=0, y)=1, 


and explain why in this case Theorem 13.4 cannot be used. 
What restriction is required on the value of a? 
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The eigenvalue problem 


—y"=dAy, — -y(0) = y(1) = 0,7 
is approximated by 
Yo 0y, Ey 
HAT SFT yY;,1<5j<m-1, Yy=Yn=0. 


Show that the differential equation has solution y = sinmrz, 
\ = m?n? for any positive integer m. Show also that the differ- 
ence approximation has solution Y; = sinmazx,;, 7 = 0,1,...,n, 


and give an expression for the corresponding value of yu. Use 
the fact that 


1—cos0 = 30° — SEN", El <1, 


to show that |A—j| < m*n*h?/12, and compare with the bound 
given by (13.23). 


14 


The finite element method 


14.1 Introduction: the model problem 


In Chapter 13 we explored finite difference methods for the numerical 
solution of two-point boundary value problems. The present chapter is 
devoted to the foundations of the theory of finite element methods. For 
the sake of simplicity the exposition will be, at least initially, confined 
to the second-order ordinary differential equation 


: (oa) 5) tr(zju= f(x), a<a<b, (14.1) 


dx dx 


where p € C![a, bl, r € Cla, b], f € L?(a,b) and p(x) > co > 0, r(x) > 0 
for all x € [a,b], subject to the boundary conditions 


u(a) =A, u(b)=B. (14.2) 


Later on in the chapter, in Section 14.5, we shall also consider the ordi- 
nary differential equation 


“ (o(a) 5) | ae tr(iz)ju= f(x), a<a<b, (14.3) 


subject to the boundary conditions (14.2). Indeed, much of the mat- 
erial discussed here can be extended to partial differential equations; for 
pointers to the relevant literature we refer to the Notes at the end of the 
chapter. 
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The finite element method was proposed in a paper by Richard Courant 
in the early 1940s,' although the historical roots of the method can be 
traced back to earlier work by Galerkin? in 1915; unfortunately, the rel- 
evance of Courant’s article was not recognised at the time and the idea 
was forgotten. In the early 1950s the method was rediscovered by en- 
gineers, but its systematic mathematical analysis began only a decade 
later. Since then, the finite element method has been developed into one 
of the most general and powerful techniques for the numerical solution 
of differential equations which is widely used in engineering design and 
analysis. 

Unlike finite difference schemes which seek to approximate the un- 
known analytical solution to a differential equation at a finite number 
of selected points, the grid points or mesh points in the computational 
domain, the finite element method supplies an approximation to the 
analytical solution in the form of a piecewise polynomial function, de- 
fined over the entire computational domain. For example, in the case of 
the boundary value problem (14.1), (14.2), the simplest finite element 
method uses a linear spline, defined over the interval [a, b], to approxi- 
mate the analytical solution wu. 

We shall consider two techniques for the construction of finite ele- 
ment approximations: the Rayleigh—Ritz principle and the Galerkin 
principle. In the case of the boundary value problem (14.1), (14.2) the 
approximations which stem from these two principles will be seen to co- 
incide. We note, however, that since the Rayleigh—Ritz principle relies 
on the fact that the boundary value problem under consideration can 
be restated as a variational problem involving the minimisation of a cer- 
tain quadratic functional over a function space, its use is restricted to 
symmetric boundary value problems, such as (14.1), (14.2) where (14.1) 
does not contain a first-derivative term; for example, the Rayleigh—Ritz 
principle is not applicable to (14.3), (14.2) unless q(x) = 0. The precise 
sense in which the word symmetric is to be interpreted here will be clar- 
1 R. Courant, Variational methods for the solution of problems in equilibrium and 

vibrations, Bull. Amer. Math. Soc. 49, 1-23, 1943; Richard Courant (8 January 

1888, Lublinitz, Prussia, Germany (now Lubliniec, Poland) — 27 January 1972, 

New Rochelle, New York, USA). For an illuminating account of the lives of Richard 

Courant and David Hilbert, see the book of Constance Reid: Hilbert-Courant, 

Springer, New York, 1986. 

? Boris Grigorievich Galerkin (4 March 1871, Polotsk, Russia (now in Belarus) — 12 
June 1945, Moscow, USSR) studied mathematics and engineering at the St Peters- 
burg Technological Institute. During his studies he supported himself by private 
tutoring and working as a designer. His ideas on the approximate solution of dif- 


ferential equations were published in 1915. From 1940 until his death, Galerkin 
was head of the Institute of Mechanics of the Soviet Academy of Sciences. 
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ified later in the chapter. On the other hand, as we shall see in Section 
14.5, the Galerkin principle is more generally applicable and does not 
require symmetry of the boundary value problem. 

To make these observations rigorous, we recall from Chapter 11 the 
concept of Sobolev space. 


Definition 14.1 For a positive integer k, we define the Sobolev space 
H*(a,b) as the set of real-valued functions v defined on [a,b] such that v 
and all of its derivatives of order up to and including k—1 are absolutely 
continuous on [a,b] and 

d* 


ky) __ GU 2 
v) = € L*(a,b). 


Here L?(a,b) denotes the set of all functions defined on (a,b) such that 


b 1/2 
I!vll2 = llulln2 (ao) = (/ boyPar) 


is finite. We equip H*(a,b) with the Sobolev norm 


1/2 
Ilvll* (a0) = > [lo [22 oo) 


where vO = v. 


The Sobolev spaces H!(a,b) and H?(a, b) corresponding, respectively, 
to k = 1 and k = 2 will be particularly relevant in this chapter. The 
next definition introduces variants of the space H!(a, b) required for the 
imposition of the boundary conditions (14.2). 


Definition 14.2 (i) Given that A and B are real numbers, Hy(a, b) 
will denote the set of all functions v € H'(a,b) such that v(a) = A and 
v(b) = B. 

(ii) H}(a,b) will signify the set of all functions v € H'(a,b) such that 
v(a) = 0 and v(b) = 0. 


In the next section we shall state, using Sobolev spaces, the Rayleigh— 
Ritz and Galerkin principles associated with the boundary value problem 
(14.1), (14.2), and explore their relationship. 


388 14 The finite element method 


14.2 Rayleigh—Ritz and Galerkin principles 


The Rayleigh-Ritz principle relies on converting the boundary value 
problem (14.1), (14.2) into a variational problem involving the minimi- 
sation of a certain quadratic functional over a function space. 

Let us define the quadratic functional 7: Hj,(a, b) > R by 


b b 
Fw) =3 | wav’? +r(@u*}ae— fF fle)w(a)az 
where w € Hj;(a, b), and consider the following variational problem: 
(RR) find u € H}(a,b) such that Z(u) = minycHt (a,b) J(w) ; 


which we shall henceforth refer to as the Rayleigh—Ritz principle. 
For the sake of notational simplicity we define 


b 
A(w,v) = i [p(x) w'(x)u"(x) + r(x) w(x)v(2)]dx 
and recall from Chapter 9 the definition of inner product on L?(a, b): 
b 
(w,v) = / w(a)u(x)da. 


Using these, we can rewrite 7(w) as follows: 


J(w) = sA(w, w) —(f,w), w € Hp(a,b). (14.4) 


The mapping A: H'(a,b) x H'(a,b) > R is a bilinear functional in 
the following sense: 


@ Aw + A2W2, v) = A1A(w1, v) + A2A(w2, v) 
for all \,,A2 € R and all w,, we, v € H'(a, db); 
@ A(w, uiv1 + H2V2) = "1 A(w, v1) + H2A(w, v2) 
for all 11, 2 € R and all w, v1, v2 € H'(a,b). 


We note, in addition, that the bilinear functional A(-, - ) issymmetric, 
in that 


A(w,v) = A(v,w) Vw,v € H\(a,b). (14.5) 


Our next result provides an equivalent characterisation of the Rayleigh— 
Ritz principle; it relies on the fact that the bilinear functional A(-, - ) 
is symmetric in the sense of (14.5). 
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Theorem 14.1 A function u in Hi (a,b) minimises J(-) over Hi (a,b) 
if, and only if, 
(G) A(u,v) =(f,v) Vu € HG(a,)). (14.6) 
This identity will be referred to as the Galerkin principle. 


Proof of theorem Suppose that wu € H}(a,b) minimises J( - ) over 
Hi (a, 6); that is, J(u) < J(w) for all w € Hi(a,b). Noting that w = 
u-+ Av belongs to Hi(a, b) for all \ € R and all v € H}(a,b), we deduce 
that 


T(u) < F(ut+rv) = sA(u+ Av,u+ Av) — (f,u+ Av) 
J(u) + AA(u, v) — (f,v)] + $A A(v, v) (14.7) 
for all v € Hi(a,b) and all A € R. Here, in the transition from the first 
line to the second we made use of the fact that A(u,v) = A(v, u) for all 
v in H9(a, b), which follows from (14.5). Now, (14.7) implies that 

—$ Av, v) < A[A(u, v) 7 (f, v)] 


for all v € Hd(a,b) and all A € R. Let us suppose that \ > 0, divide 
both sides of the last inequality by A and pass to the limit 4 — 0 to 
deduce that 


I 


0< A(u,v) — (f,v) Vu € Hg(a, b). (14.8) 
On replacing v by —v in (14.8), we have that also 
0> Alu, v) — (f,v) Vu € Ho(a,b). (14.9) 
We conclude from (14.8) and (14.9) that 
A(u,v) = (f,v) Vu € Ho(a,b), (14.10) 


as required. 
Conversely, if u € Hf(a,b) is such that A(u,v) = (f,v) for all v in 
H}(a, b), then 
JI(ut+ rv) = J(u) + A[A(u, v) — (f,v)] + $7 A(v, v) > J(u) 
for all v € H}(a,b) and all A € R; therefore, wu minimises 7( - ) over 
Hi (a, 0). 


Thus we have shown that, as long as A(- , - ) is a symmetric bilinear 
functional, u € Hi(a, b) satisfies the Rayleigh-Ritz principle if, and only 
if, it satisfies the Galerkin principle. Our next task is to explain the 


1 Tn the language of the calculus of variations, (G) is the Euler-Lagrange equation 
for the minimisation problem (RR). 
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relationship between (RR) and (G) on the one-hand and (14.1), (14.2) on 
the other. Since in the case of a symmetric bilinear functional A(- , - ) 
the principles (RR) and (G) are equivalent, it is sufficient to clarify the 
connection between (G), for example, and the boundary value problem 
(14.1), (14.2). 

We begin with the following definition. 


Definition 14.3 If a function u € Hi(a,6) satisfies the Galerkin prin- 
ciple (14.6), it is called a weak solution to the boundary value problem 
(14.1), (14.2), and the Galerkin principle is referred to as the weak 
formulation of the boundary value problem (14.1), (14.2). 


Let us justify this terminology. Suppose that u € H?(a,b) N Hj,(a, b) 
is a solution to the boundary value problem (14.1), (14.2). Then, 


i (oa) =) tr(xju= f(z), (14.11) 


for almost every x € (a,b) (see the discussion prior to Example 11.1 for 
a definition of almost every). Multiplying this equality by an arbitrary 
function v € H4(a,b), and integrating over (a,b), we conclude that 


i < (o)$) vdz 4 i: r(x)uv da = i f(a)v(a) da. 


On integration by parts in the first term on the left-hand side, 


b b b 
d du du du du 
| ae (60 =) vdz= (a) eo +f P(®) TF gy dt 


zr=a 


Since, by hypothesis, v(a) = 0 and v(b) = 0, it follows that 


[tars [re wvae= f° sayo(a)ax 


for all v € H}(a,b). Thus, we have shown the following result. 


Theorem 14.2 [fu € H?(a,b) NH#(a,b) is a solution to the boundary 
value problem (14.1), (14.2), then u is a weak solution to this problem; 
that is, 


A(u,v) = (f,v) Vv € Hj(a,b). (14.12) 
The converse implication, namely that any weak solution u € Hj,(a, b) 


of (14.1), (14.2) belongs to H?(a, b) N Hi,(a, b) and solves (14.1), (14.2) 
in the usual (pointwise) sense, is not true in general, unless the weak 
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solution can be shown to be sufficiently smooth to belong to H?(a, 0). 
It is for this reason that any function u € H#(a,b) satisfying (14.12) is 
called a weak solution of the original boundary value problem. 

Thus, Theorem 14.1 shows that u € Hi(a,b) is a weak solution to 
(14.1), (14.2) if, and only if, it minimises 7( - ) over H}(a, b). Next, we 
show that if a weak solution exists then it must be unique. 


Theorem 14.3 The boundary value problem (14.1), (14.2) possesses at 
most one weak solution in H},(a, 6). 


Proof The proof is by contradiction. Suppose that u € Hi(a,b) and 
au € Hji,(a,b) are two weak solutions to (14.1), (14.2). Then, u — @ 


belongs to Hj(a, b), and 
A(u — tt, v) = A(u,v) — A(a, v) = (f,v) — (f,v) =0 


for all v € H}(a,b). In particular, 
A(u—ti,u—%) =0. 
However, since p(x) > cg > 0 and r(x) > 0 for all x in {a, 8], 
b b 
A(v,v) = / (p(x) (v’)? + r(x)v7]da > co | |v’ |?da. 
On choosing v = u — &, this implies that 
b 
0=A(u-t,u-—u)> co | |(u — &)/|?da. 


Since the right-hand side in the last inequality is nonnegative, it follows 
that (u— %)’(x) = 0 for almost every x in (a,b); as u— & is absolutely 
continuous on [a,b] and (u — &)(a) = (u — t%)(b) = 0, we conclude that 


u = U, and hence we get the desired uniqueness of a weak solution. 


It turns out that under the present hypotheses on p, q and f the 
existence of a weak solution u € Hj,(a,b) is also ensured, although the 
proof of this is less simple and is omitted here; the interested reader is 
referred to the literature listed in the Notes at the end of the chapter. 


14.3 Formulation of the finite element method 


In the previous section we showed that the weak solution to the boundary 
value problem (14.1), (14.2) minimises 7( - ) over Hj,(a,b). The finite 


in) 
i] 


element method is based on constructing an approximate solution u” to 
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the problem by minimising J(- ) over a finite-dimensional subset S/ of 
Hi (a, 6), instead. 

A simple way of constructing S/ is to choose any function w € Hi (a, b), 
for example, 


B-A 
W(x) = (a—a)+A (14.13) 
b-—a 
and a finite set of linearly independent functions y;, 7 = 1,...,n—1, in 


H}(a, b) for n > 2, and then define 


n-1 


Sk = {v" €Hh(a,b): v Nt Dae) , 


where (v,... wane €R™}}. 
We consider the following approximation of problem (RR): 
(RR)" find u’ € SR such that Z(u”) = min, nesn T(w"). 
Our next result is a finite-dimensional analogue of Theorem 14.1. 


Theorem 14.4 A function u® € Sk minimises J(-) over SE if, and 
only af, 


(G)? A(u®,v™) =(f,u% Vue sh. (14.14) 
Here, 
St = {v" €H}(a,): =F wt ; 
where ie. buaylipay = Rea. 


The problem (ays can be thought of as an approximation to the 
Galerkin principle (G), and is therefore referred to as the Galerkin 
method. For a similar reason, (RR)" is called the Rayleigh-Ritz method, 
or just Ritz method. Thus, in complete analogy with the equivalence 
of (RR) and (G) formulated in Theorem 14.1, Theorem 14.4 now ex- 
presses the equivalence of (RR)" and (G)", the approximations to (RR) 
and (G), respectively. Of course, as in the case of (RR) and (G), the 
equivalence of (RR)" and (G)" relies on the assumption that the bilin- 
ear functional A(-, - ) is symmetric. The proof is identical to that of 
Theorem 14.1, and is left as an exercise. 

Theorem 14.4 provides no information about the existence and unique- 
ness of u” that minimises 7(- ) over S% (or, equivalently, of the existence 
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and uniqueness of u” that satisfies (14.14)). This question is settled by 
our next result. 


Theorem 14.5 There exists a unique function u” € S% that minimises 
J(-) over Sh; this u" is called the Ritz approximation to u. Equiva- 
lently, there exists a unique function u” € S% that satisfies (14.14); this 
u is called the Galerkin approximation to u. The Ritz and Galerkin 


approximations to u coincide. 


Proof We shall prove the second of these two equivalent statements: 
we shall show that there exists a unique u” € Sh that satisfies (14.14). 
The proof of uniqueness of u” € ge is analogous to the proof of Theo- 
rem 14.3, with u, a, Hi(a,b) and H§(a, b), replaced by u”, &”, S% and 
SF, respectively. Since gk is finite-dimensional, the uniqueness of u” 


satisfying (14.14) implies its existence. 


Having shown the existence and uniqueness of u” minimising J( - ) 
over S$! (or, equivalently, satisfying (14.14)), we adopt the following 
definition. 


Definition 14.4 The functions p;,1=1,2,...,n2—1, appearing in the 
definitions of S& and S? are called the Galerkin basis functions. 


Since any function v” € S? can be represented as a linear combination 
of the Galerkin basis functions y;, 1 <i <n-—1, it is clear that (14.14) 
is equivalent to 


A(u", oi) =(f, i), 1<i<n-1. (14.15) 


As u" belongs to S$, it can be expressed in terms of ~ and the Galerkin 
basis functions as 


n-1 
u®(a) = ¥(2) + >) usy,(2), 
j=l 
where u; € R, j = 1,...,n—1, are to be determined. On substituting 


this expansion of u” into (14.15), we arrive at the following system of 
simultaneous linear equations: 


n-1 
32 Miyjuj =b:, 1<isn-1, (14.16) 
j=l 
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where 
Mij = A(yj, i), bi = (Ff, 9%) — AY, gi) - (14.17) 


The coefficients u;, 1 < 7 <m-—1, in the representation of the approxi- 
mate solution are thus obtained by solving the system of linear equations 
(14.16). The matrix M is, clearly, symmetric (since the bilinear form 
A(-, - ) is symmetric by hypothesis) and positive definite, because 


v' Mv=A(v,v) > 0, 


where v = (v1, V2,---,Un—1)' € R"! is any nonzero vector and v = 
UY1 tees + Un-1¥n-1 © Sy. 

The Ritz and Galerkin methods can be used to compute an approx- 
imation u” to u as a linear combination of any finite set of linearly in- 
dependent functions y;, 1 <i <n—1, in H}(a,b). We obtain the Ritz 
finite element method and the Galerkin finite element method, 
respectively, when we select the approximating subspaces S# and S? in 
the Ritz or the Galerkin method to be spaces of spline functions (see 
Chapter 11). Here we only consider the simplest case of linear splines, 
and choose the basis functions y;, 1 <7 < n—-—1, to be the hat functions 
(11.4). We begin by fixing a set of points x,, k = 0,1,...,n, n > 2, in 
the interval [a,b] such that 


@=% <2, <9 < at, =b. (14.18) 


The intervals [z;-1, zi], 1 <i <n, are referred to as elements; hence 
the name finite element method. In the theory of the finite element 
methods (14.18) is called a subdivision of the computational domain 
(a, b], and the points x, are called mesh points. The function y; is the 
piecewise linear function which takes the value 0 at all the mesh points 
except x;, where it takes the value 1. Thus, 


(a — aj-1)/hy ifa_y~<xr<u, 
vil) = (X44 oe x) [hig if xy < xv < Li+1 y (14.19) 
0 otherwise , 


where h; = x; — 2;-1. The functions y;, 1 < i < n—1, are called the 
(piecewise linear) finite element basis functions and the associated 
Galerkin approximation wu” is referred to as the (piecewise linear) finite 
element approximation of u. The closure of the interval (xj, vi+1) 
over which y; is nonzero is called the support of the function y;. The 
piecewise linear finite element basis function y;, 1 <7 < n—-—1, with 
support [2;-1, 2:41], is depicted in Figure 14.1. 
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Ty vj Ui41 


Fig. 14.1. A piecewise linear finite element basis function, y;,1<i<n-—1. 


For the finite element method the important property of the basis 
functions y;, 1 < i < n—1, is that they have local support, being 
nonzero only in one pair of adjacent intervals, (#;_1,2;] and [a;,7;41). 
This means that, in the matrix M, 


M,;=0 if i—jl>1. 


The matrix M is, therefore, symmetric, positive definite and tridiago- 
nal, and the associated system of linear equations can be solved very 
efficiently by the methods of Section 3.3, the most efficient algorithm 
being LU decomposition, without any use of symmetry. The fact that 
M is positive definite means that no interchanges are necessary. 

The function 7 in (14.13), which is included in the definition of $% to 
ensure that u” satisfies the boundary conditions at 2 = a and x = b, is 
then given by 


W(x) = Ago(x) + Byn(z), 


which is also piecewise linear; clearly, W(a) = A and w(b) = B. Here, yo 
and y,, are defined by setting, respectively, i = 0 and i = n in (14.19) 
and restricting the resulting functions to the interval [a,b] = [zo, rn]. 
In (14.17) we see that the term A(q, y;) is nonzero only for 7 = 1 and 
t=n-1. 

Before attempting to solve the system of linear equations we must, of 
course, first compute the elements of the matrix M, and the quantities 
on the right-hand side, bj, 4 = 1,...,n—1; see (14.16) and (14.17). The 


396 14 The finite element method 


matrix elements are obtained from 
b 


b 
My = Alyy, 0) = / pla) o' (a) yh(a)da + / (ce) p4(a)gi(a)de, 


a a 
with 1 < 1,7 <n-—1. We have written this as the sum of two terms, 
as the matrix M is often written in this way as the sum of two matrices 
which, for historical reasons, are often known as the stiffness matrix 
and the mass matrix, respectively. The terms M!;; are very simple; in 
fact in the first integral the derivatives yp, and yj are piecewise constant 
functions over [a, BJ. 

It may be possible to compute these integrals analytically, but more 
generally some form of numerical quadrature will be necessary. It is 
then easy to show that if we use certain types of quadrature formulae 
we shall be led to the same system of equations as in the finite difference 
method of Section 13.5. Consider the particularly simple case where the 
mesh points are equally spaced, so that x; = a+ jh, 7 = 0,1,...,n, 
h = (b—a)/n. If we then approximate the integrals involved in the 
stiffness matrix by the midpoint rule (see Chapter 10), we obtain 


[ r@eawelaae = -ayn) [pleas 


~ —Pi-1/2/h, 


where p;_1/2 = p(xi—h/2), and similarly for the other integrals involved. 
For the integrals in the mass matrix we use the trapezium rule, and then 


| reopralwieia)ae 0, 


i 


since y; is zero at x;_1 and y;_] is zero at x;. In the same way 


Li 
[ releda)Pae = bir, 
Li-1 
where r; = r(a;), since y; is zero at one end of the interval and unity at 
the other. The other part of the integral is, similarly, 


Cit1 
; r(x)[pi(a)]?dax = Shr; . (14.20) 
Assuming that f € C[a, }], approximating the integral on the right-hand 
side by the trapezium rule in the same way, and putting all the parts 
together, equation (14.14) now takes the approximate form 

_ Pi-1/2 + 


Pi-1/2 + Pi+1/2 Pi+1/2 
h = 


h ‘ h 


i-1 + uiqi thrjyu; =hfi, 
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for i =1,2,...,n—1, with the notational convention that uo = A and 
Un = B, and f; = f(a); clearly, this is the same as the finite difference 
equation (13.19). Of course, had we used a different set of basis func- 
tions y;, 1 <i < n—1, or different numerical quadrature rules, the finite 
element and finite difference methods would have no longer been identi- 
cal. Indeed, this example is just an illustration of the relation between 
the two methods; we should normally expect to compute the entries of 
the matrix M by using some more accurate quadrature method, such as 
a two-point Gauss formula. 

In the next two sections we shall assess the accuracy of the finite 
element method. Our goal is to quantify the amount of reduction in the 
error u—u as the mesh spacing h is reduced. 


14.4 Error analysis of the finite element method 


We begin with a fundamental result that underlies the error analysis of 
finite element methods. 


Theorem 14.6 (Céa’s Lemma) Suppose that u is the function that 
minimises J(u) over Hi(a,b) (or, equivalently, that u satisfies (14.6)), 
and that u" is its Galerkin approximation obtained by minimising J( - ) 
over St (or, equivalently, that u” satisfies (14.14)). Then, 


inl 
ia) 


A(u —u",v") =0 Vue Se, (14.21) 
and 
A(u—u?,u—u’) = min A(u—v",u—v"). (14.22) 
vee sk 


The identity (14.21) is referred to as Galerkin orthogonality. The 
terminology stems from the fact that, since the bilinear functional A(-, -) 
is symmetric and A(v,v) > 0 for all v € Hg(a,b) \ {0}, A(-, +) is an 
inner product in the linear space Hj(a,b). Therefore, by virtue of Def 
inition 9.2, (14.21) means that u— wu" is orthogonal to S$? in H}(a,b). 
A geometrical illustration of Galerkin orthogonality is given in Figure 
14.2. Given that ~ is a fixed element of Hi,(a, 6), the mapping 


R’.u— € Hiab) bu" -pe Sb 


which assigns a u” € Sh to u € Hh(a,b) (where u and u” are as in 
Theorem 14.6) is called the Ritz projector. 
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Fig. 14.2. Illustration of the Galerkin orthogonality property of the finite ele- 
ment method. A((u— 7) —(u” —w),v") = A(u—u", v") = 0 for all v” in S2. 
Here, (2) = Ayo(x) + Byn(x), so that u— w € Ho(a,b) and u" —w € SB. 
The 0 in the figure denotes the zero element of the linear space SQ (and, si- 
multaneously, that of H}(a, b)), namely the function that is identically zero on 
the interval (a,b). 


Proof of theorem By the definition of the Galerkin method (G)", 
A(u,v') = (fv) vote gh 

On the other hand, we deduce from (G) that 
A(u,v") = (f,v") Vole GP, 


since v’ € S? Cc Ho(a,b). The Galerkin orthogonality property (14.21) 
follows by subtraction. 
Now suppose that v” is any function in S®; then, 


A(u —v",u—v") 


A(u—u® +u" — v8 uu tu" — vv") 
hah 


= A(u—u",u—u") 4+ A(u® —v",u*® — v"), 


by Galerkin orthogonality, given that wu” — v’ € S@. In the transition 
from the first line to the second, we made use of the fact that the bi- 
linear functional A is symmetric. As the term A(u” — v",u" — v") is 
nonnegative, we deduce that 


A(u — u",u—ul) < A(u—v",u—v") Vole SF, 


with equality when v” = wu"; hence (14.22). 
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Motivated by the minimisation property (14.22), we define the energy 
norm || - ||,4 on H4(a,b) via 


llvlla = [AQ@, v7. (14.23) 


Under our hypotheses on p and gq, it is easy to see that || - ||_4 satisfies 
all axioms of norm (see Chapter 2). The result we have just proved 
shows that wu! is the best approximation from S% to the true solution u € 


H}(a, 6) of our problem, when we measure the error of the approximation 
in the energy norm: 


Ju — uw" ||_4 = min Ju —v" la. (14.24) 
v E 


A particularly relevant question is how the error u — u” depends on 
the spacing h of the subdivision of the computational domain [a,b]. We 


can obtain a bound on the error u — u” 


, measured in the energy norm, 
by choosing a particular function v” € S$" in (14.24) whose closeness to 
u is easy to assess. For this purpose, we introduce the finite element 


interpolant Zu € Sh of u € Hh(a,b) by 


T"u(z) = ¥(x) + Ss u(xi)yi(x) , x € [a, | 
w=1 
Clearly, 
T'u(aj) = u(a;) , j=0,1, nN, 


which justifies our use of the word interpolant. 
We then deduce from (14.24) that 


lju—u"l|a < lu-Thulla; (14.25) 


hence, in order to quantify ||u—u'||_4, we only need to estimate the size 
of |ju—Z"ul|4. This leads us to the next theorem. 


Theorem 14.7 Suppose that u € H?(a,b) NH}(a,b) and let Z”u be the 
finite element interpolant of u from St defined above; then, the following 


error bounds hold: 


hy\? 

Iu — LP ull2 (a; 1 0%) < = lead lberemenes ’ 
Tv 

| = ese ! < iyo 

U u) lecerere es 7 ile Revere ’ 


fori =1,2,...,n, where hy = x; — xj-1. 
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Proof Consider an element [x;-1,2;], 1 < i < n, and define ¢(a) = 
u(x) —Z u(x) for x € [a;-1,2;]. Then, ¢ € H?(2;_1,2;) and ¢(x;_1) = 
¢(a;) = 0. Therefore ¢ can be expanded into a convergent Fourier sine- 
series, 


oe _ kn(a@ — 24-1) 
C(x) _ 2% sin —— ’ DE [ej—15 ¢4] “ 


Here, convergence is to be understood in the norm ||: ||,2(z;_,,2;)- Hence, 


a (¢(ar)[Pan 


i - Li-1 


lI 
IN 
Fes 

8 
bd 
va 
Fr 
Ss 

Qo 

8 


= oe : kr(a = Xj-1) bn(a = Xj-1) 
= Ss axas | sin hy n 7 dx 
k,=1 Ti-1 : ‘ 
oo 1 
= h; x anas | sin kart sin nt dt 
kl=1 0 


where 6xz¢ is the Kronecker delta. Differentiating the Fourier sine series 
of ¢ twice, we find that the Fourier coefficients of ¢’ are (k7/h;)ax, 


while those of ¢" are —(ka/h;)?a,. Thus, proceeding in the same way 
as above, 


Te sed ee x0 hi oo (kr\” 2 
war = BY (EY jaye 
Ui-1 =1 « 
m% hi Oy (ka \* 
[ etwrar = BO (F) lak. 
vind St Se 
Because k* > k? > 1, it follows that 


[ora < (*) f° corer, 


‘= 


| ¥ (¢'(x) Pa 


A 
fo 
a |= 
NW, 

bo 
Cao 
| s. 

a 
ow 
8 
< 
are 
Q 
8 


However, ¢"(x) = u(x) — (Z"u)"(x) = wu" (x) for x € (aj-1,2;), and 
hence the desired bounds on the interpolation error. 
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Now, substituting the bounds from Theorem 14.7 into the definition 
of the norm ||u — Z"ul|4, we arrive at the following estimate of the 
interpolation error in the energy norm. 


Corollary 14.1 Suppose that u € H?(a,b) VHp(a,b). Then, 


n h; 2 hy 4 
Iu — Thule soi) n+(2) n| le Mis ore 
41 


where P; = MaXze[z,_1,2,) P(t) and Ri = maxze[z,_,,2,)7(2)- 


Proof Let us observe that 
Av, v) 


b 
=f Pep? +rewo@P] ae 


I 


llollZ 


= Sf pee’ @P+ra)loe)?] ax 
1S1p tek 
< SOL Pille’ Weegee + PillOleces 00 } 
i=1 
On letting v = u—Z"u and applying the preceding theorem on the right- 
hand side of the last inequality, with v’ and v replaced by u! — (Z"u)’ 
and u— Zu, respectively, the result follows. 


Inserting this estimate into (14.25) leads to the desired bound on the 
error between the analytical solution u and its finite element approxi- 
mation u” in the energy norm. 


Corollary 14.2 Suppose that u € H?(a,b) 1 H},(a,b). Then, 


ewe rd (H) 4 (HY bg 
U A= 1 a 1 u U L?(xi-1,%3) ? 
i=1 


where P, = mMaXze[x, _,,2,] P(x) and Rj = maxzeiz, ,,2,) T(z). Further, 


h h 2 1/2 
llu-u"llas—4P+ Re [lu |In2(a,8); (14.26) 
TT 


T 


where P=maxye[a,p] D(z), R=Maxzeja,p] T(x), and h=maxi<i<n hj. 
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Fig. 14.3. Graph of the finite element approximation u” to the analytical 
solution u of the boundary value problem (14.27) on a uniform subdivision 
of [0,1] of spacing h = 1/n, with n = 2 (top left), n = 4 (top right), n = 6 
(bottom left), and n = 100 (bottom right). In each of the four subfigures, the 
dashed curve is the graph of the analytical solution u(x) = sin(7a). In the last 
figure the approximation error is so small that u and u” are indistinguishable. 


In order to illustrate the performance of the finite element method, 
we consider the following example: 


—ul +r(x)u = f(x), x € (0,1), u(0)=0, wu(1)=0. (14.27) 


If r(z) = 1 and f(x) = (1+ 7?)sin(7x), the unique solution to this 
problem is u(#) = sin(az). Let us pretend that we do not know the 
analytical solution u, and solve the boundary value problem numerically, 
using the finite element method on a subdivision of [0,1] of uniform 
spacing h = 1/n, for various values of n. The integrals (f,;) involved 
in the definition of 6; in (14.17) have been approximated, on each of the 
elements [x;_1,2;], 1 < i <n, by means of the trapezium rule. The 
resulting approximations wu”, for n = 2,4,6,100, are shown in Figure 
14.3. 
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We see from Figure 14.3 that, as the spacing h of the subdivision 
is reduced, the finite element solution wu” approximates the analytical 
solution u(a#) = sin(7a) with increasing accuracy. Indeed, the results 
corresponding to n = 2 and n = 4 in Figure 14.3 indicate that as the 
number of intervals in the subdivision is doubled (i.e., h is halved), the 
maximum error between u(x) and u"(z) is reduced by a factor of about 4. 
This reduction in the error cannot be explained by Corollary 14.2 which 
merely implies that halving h should lead to a reduction in ||u — u"||_4 
by a factor no less than 2. If you would like to learn more about the 
source of the observed enhancement of accuracy, consult Exercise 5 at 
the end of the chapter. 
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The bound on the error between the analytical solution u and its finite 
element approximation wu! formulated in Corollary 14.2 shows that, in 
the limit of h — 0, the error |/u — u"||,4 will tend to zero as O(h). 
This is a useful result from the theoretical point of view: it reassures us 
that the unknown analytical solution may be approximated arbitrarily 
well by making A sufficiently small. On the other hand, asymptotic 
error bounds of this kind are not particularly helpful for the purpose of 
precisely quantifying the size of the error between u and wu! for a given, 
fixed, mesh size h > 0: as u is unknown, it is difficult to tell just how 
large the right-hand side of (14.26) really is. 

The aim of the present section is, therefore, to derive a computable 
bound on the error, and to demonstrate how such a bound may be 
implemented into an adaptive mesh-refinement algorithm, capable of 
reducing the error u— u” below a certain prescribed tolerance in an 
automated manner, without human intervention. The approach is based 
on seeking a bound on u— wu" in terms of the computed solution u” 
rather than in terms of norms of the unknown analytical solution u. A 
bound on the error in terms of uw is referred to as an a posteriori 
error bound, due to the fact that it becomes computable only after the 
numerical solution u” has been obtained. 

In order to illuminate the key ideas while avoiding technical difficul- 
ties, we shall consider the two-point boundary value problem 


—(p(xz)u')' + q(x)u'+r(xz)u = f(z), a<a<b, (14.28) 
u(a) = A, u(b)=B, (14.29) 
where p,q € C![a, b], r € Cla,b] and f € L?(a,b). We shall assume, as 
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at the beginning of the chapter, that p(x) > co > 0, x € [a, b]; however, 
instead of supposing that r(a) > 0, we shall now demand that 


r(a) — 5 (2) >a, wvé lad, (14.30) 


where c, is assumed to be a positive constant.? 
Letting 


b 
A(w,v) =| [p(x )w"(x)u'(a) + q(a)w"(x)o(a) + r(a)yw(a)o(a)|de , 
the weak formulation of (14.28), (14.29) is as follows: 
find u € Hj,(a, b) such that A(u,v) = (f,v) Vv € Ho(a,b). (14.31) 


Here, the bilinear functional A(-, - ) is not symmetric, unless q(x) = 0: 
indeed, A(w,v) = A(v,w) for all v,w € H'(a,b) if, and only if, q = 0. 
Hence, in general, the boundary value problem (14.28), (14.29) cannot 
be assigned a Ritz principle. On the other hand, the Galerkin principle 
(weak formulation) (14.31) is perfectly meaningful for any choice of q. 

The Galerkin finite element approximation of (14.31) is constructed 
by introducing a (possibly nonuniform) subdivision of the interval [a, b] 
defined by the points 


A=% <4 < +++ << Xn_-1 <n =) 


and considering the finite element space S% Cc Hi(a, b) consisting of all 
continuous piecewise linear functions v” on this subdivision that satisfy 
the boundary conditions v"(a) = A and v"(b) = B. The Galerkin finite 
element approximation of the boundary value problem is 


find u’ € S% such that A(u”,v") = (f, vu") Vure Sh. (14.32) 


We let hy = 7 — 2-1, i=1,...,n, and put h = max; h,. 

We wish to derive an a posteriori bound on the error in the || - ||,2(a,0) 
norm; that is, our aim is to quantify the size of ||ju—u"||,2(a,») in terms 
of the mesh parameter h and the computed solution u” (rather than 
in terms of the analytical solution wu as was the case in the a priori 
error analysis developed in the previous section). For this purpose, we 
1 At the expense of slight technical complications in the subsequent discussion, the 

requirement that c1 > 0 can be relaxed to cy > —A1i1, where 1 is the smallest 


(positive) eigenvalue for the Sturm—Liouville eigenvalue problem —(p(x)w’)’ = Aw 
for x € (a,b), w(a) = 0, w(b) = 0. 
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consider the auxiliary boundary value problem 
—(p(x)z’)’ — (q(a)z) +r(z)z = (u-—u")\(z), a<a<b, (14.33) 
z(a) = 0, 2(b)=0, (14.34) 

called the dual problem (or adjoint problem). 
We begin our error analysis by noting that the definition of the dual 


problem and straightforward integration by parts yield (recalling that 
(u—u")(a) = 0, (u— u")(b) = 0) 


ju — ul llF2¢a,s) = (u—u",u—u") 
= (u—u",—(p2')' — (qz)' + rz) 
= A(u-u",z). 


On the other hand, (14.31) and (14.32) imply the Galerkin orthogonality 
property 
Atui—u®,2")=0 Vzest. 
In particular, by choosing 
=T'ze sh ; 


the continuous piecewise linear interpolant of the function z € H}(a, b), 
associated with the subdivision a = xp < 41 <-++-: <%n_1 < @n = 5, we 
have that 


A(u—wu",Z"z) =0. 
Thus, 
Ju—a'Boe) = A —ut,z—T'2) 
A(u,2z—I"z) — A(u®, z-—T"z) 
(f,2—-I"z) —A(u,z-I"z), (14.35) 


I 


where the last transition follows from (14.31) with v = z—Z"z. 
We observe that the right-hand side no longer involves the unknown 
analytical solution u. Furthermore, 


ay 


A(u",z—T"z) = S- p(x)(u")' (x) (z —Z"z)'(x) da 


g=1 % Vi-1 


+30 ale) (uy w) (@-T'2\(a) ae 


i. r(x) u(x) (z —Z"z)(x) da. 


i=1 i- 
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Integrating by parts in each of the n integrals in the first sum on the 
right-hand side, noting that (z — Z"z)(x;) = 0, i= 0,...,n, we deduce 
that 


A(u",z—2"z) 
=f Loe eu'yy’ + aeyuty + raul] (Th) (a) ae. 
i=1 7 Vi-1 
Furthermore, 
(f,z—2"z) = yi f(x) (z—I"z)(x) dz. 
i=1 7 Vi-1 

Substituting these two identities into (14.35), we deduce that 

Iu w ian = df RONe)e-Ta\@ae, 04.36) 

i=1 7 Vi-1 


where, for 1 <i<n, and x € (x;_1, 2%), 


R(u") (x) = f(x) — [-((a) (un) + a(x)(u")’ + r(a)u*) . 


The function R(w") is called the finite element residual; it measures 
the extent to which u” fails to satisfy the differential equation 


—(p(a)u')' + q(a)u' + r(w)u = f(a) 


on the union of the intervals (7;_-1,7;), 7 =1,...,n. Now, applying the 
Cauchy—Schwarz inequality on the right-hand side of (14.36) yields 


Iu i wv" llP2(a,) < 2 || R(u") IIn2 (51,04) |z = Deal eit es . 


i=l 


Recalling from Theorem 14.7 that 


hi\* 
Ilz — 2D" 2llt2 (0,-1.2) < (=) ladle » +=1,2,...,n, 


we deduce that 


2" Iln2(ei1 08) » 


1 n 
uu MEacas) S oD RE R(e hte ea.cs) 


i=l 


and consequently, using the Cauchy—Schwarz inequality for finite sums, 


_ m Df oa 1/2 
So aidj < (3?) (som) 
i=l 


i=l i=l 
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with 
ay, = A? ||R(u")||12(@,-1,28) and b= | Ealberemeee ’ 
we find that 
1/2 
|ju — uw" \IF2¢a,8) S = 4( he Ru )|IF2(0,_, ~] l2""Ilz2(0,1)- 


(14.37) 
The rest of the discussion is aimed at eliminating ||z’’||,2(a,5) from the 
right-hand side of (14.37). The desired a posteriori bound on the error 
lu — wu" ||L2(a,5) in terms of R(w") will then follow. 


Lemma 14.1 Suppose that z is the solution of the dual problem (14.83), 
(14.84). Then, there exists a positive constant K, dependent only on p, 
q andr, such that 


ll2"Iln2(0,0) S Kllu — u"|In2(a,5) - 


Proof As 


it follows that 


and therefore, recalling that p(x) > co > 0 for x € [a, 6], 
coll2”IIL2(an) << Ilu— wu" IIL (0,0) + [IP" + allo [l2"IIL2 (0,0) 
+r = q'lloo [l2llz2¢a,0) + (14.38) 
where we used the notation ||w||.. = MaXze[a,p) |w(x)|. 

We shall show that both |]2’||L2(a,s) and ||z||L2(a,2) cam be bounded in 
terms of ||u — u’||;2(a,») and then, using (14.38), we shall deduce that 
the same is true of ||z’’||L2(a,»). Let us observe that, by (14.33), 

(-(p2" — (42)! +12,2) = (u—uh,2), (14.39) 


Integrating by parts in the terms involving p and q and noting that 
z(0) = 0 and z(1) = 0 yields 


(=(p2')'— (gy fr, 2) = (pe, 2!) + (az, 2’) + 4rz,2Z) 


b b 
> calz!ieean ts | le wlar+ f ra)le(e)Paz. 
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Integrating by parts, again, in the second term on the right gives 
1 fre 
(2!) (2) +722) > colleen 5 ff MO wee 


b 
+f r(x)[z(a)]?da . 


Hence, from (14.39), 


b 
coll2'Recan + | (r(@)—24'@) [e@)Pax < (u—u",2), 
“ 2 


and thereby, noting (14.30) and using the Cauchy—Schwarz inequality 
on the right-hand side, 


min{¢o, 1} (2'lIP2(a,0) + llzllza(a.)) < (w—u*, 2) 
< |lu— w" IIe2(0,0) llzlln2(ae)- (14-40) 
Therefore, also 
min{co, ¢1}|2[li2(a,8) < Iu — uv" |In2(0,0) [l2llH2(a,0) » 


which means that 


I 


Iz llH2 (a,b) 
| 
min{co, ci } 


Now we substitute (14.41) into (14.38) to deduce that 


/ 
(Ilz’I2a(a.a) + Uell2aca.y) 


u-—u" rap). (14.41) 


2" IIn2 (a,b) < Klju—u"|In2(a,) » (14.42) 


where 


t | fet 2 4 nq \1/2 
ame (1 " min{co, ci} (Ilp’ + alle + llr — alle) ) , 


It is important to observe here that K involves only known quantities: 
the coefficients in the differential equation under consideration. There- 
fore K can be computed, or at least bounded above, without difficulties. 
On inserting (14.42) into (14.37), we arrive at our final result, the com- 
putable a posteriori error bound, 


i 1/2 
Iu — uw" IlL2(a,) < Ko bs AIC sa) ) (14.43) 


i=l 
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where Kp = K/n?. 

Next we shall describe the construction of an adaptive mesh refinement 
algorithm based on the a posteriori error bound (14.43). 

Suppose that TOL is a prescribed tolerance and that our aim is to 
compute a finite element approximation wu’ to the unknown solution u 
so that 


ju — uw" IIp2¢0,5) < TOL. (14.44) 


We shall use the a posteriori error bound (14.43) to achieve this goal 
by systematically refining the subdivision, and computing a succession 


h 


of numerical solutions w” on this sequence of subdivisions, until the 


inequality 


1/2 
aps he R(w")\IP2¢@, ) < TOL (14.45) 


is satisfied. Clearly, if u” satisfies (14.45), then, by virtue of (14.43), it 
also satisfies (14.44). 


In order for the inequality (14.45) to hold it is sufficient to ensure 
that, on each interval [x;-1,2;], i= 1,2,...,n, we have 
fief TOL" 
4 
heR(w" ibe.) S = 1 (2 (14.46) 


Thus, a sufficient condition for (14.44) is that (14.46) holds for all ¢ = 
Le Dosen ie 
The mesh adaptation algorithm, therefore, proceeds as follows: 


Step 1. Choose an initial subdivision 
To: a= 2 < 0) << oy < 2) =b 
of the interval [a, 6], with no a 2-9 fori = 1,2,...,n0; 
let AO) = max; hk” 


(0) : ; 
ment space S®“ (of dimension nog — 1); 
(0) (0) 
h é gh. 


, and consider the associated finite ele- 


Step 2. Compute the corresponding solution u 
Step 3. Given a computed solution u”” € Sh” for some m > 0, 
defined on a subdivision 7,,, STOP if 


Nm 1/2 
(m) 
Ko ( (n' =) ||R(u h Me aceen), ao] < TOL; 


t=1 
(14.47) 


E ? 
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Step 4. If not, then halve those elements (2), oo in Tm, with 7 
in the set {1,2,...,%m}, for which 


m)\* (m) i 7 TORY? 
(0) Rw! Race em > (FE). 4s) 
denote by Tyn41 the resulting subdivision of [a, 6] with nnii 


elements [amr YD gly of respective lengths 


(m+1) (m+1) (m+1) 
h; =0; Sey hs es Genet ome Be 
i F : (m-+1) 
and consider the associated finite element space on of 


dimension 241 — 1; 
(m+1) (m+1) 


Step 5. Compute the finite element approximation u a : 


increase m by 1 and return to Step 3. 


The inequality (14.47) is called the stopping criterion for the mesh 
adaptation algorithm, and (14.48) is referred to as the refinement cri- 
terion. According to the a posteriori error bound (14.43), when the 
adaptive algorithm terminates, the error ||u — u’||,2(a,5) is guaranteed 
not to exceed the prescribed tolerance TOL. 

We conclude the body of this chapter with a numerical experiment 
which illustrates the performance of the adaptive algorithm. 


Example 14.1 Let us consider the second-order ordinary differential 
equation 
—(p(x)u') + q(x)u't+r(xju= f(z), «€ (0,1), (14.49) 


subject to the boundary conditions 


u(0) =0, u(1) =0. (14.50) 
Suppose, for example, that 
p(z) =1, q(x) = 20, r(x) = 10 and f(z) =1. 
In this case, the analytical solution, u, can be expressed in closed form: 


1 

u(x) — Ci ere + Ce er? + 0 ; 

where A, and » are the two roots of the characteristic polynomial of 
the differential equation, —\? + 20\ +10 = 0, i.e., 


M=104+V/110, A2=10-—V110, 
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0.04 


0.035, 


L L L L 
0 0.2 0.4 0.6 0.8 1 


Fig. 14.4. Analytical solution of the boundary value problem (14.49), (14.50), 
with p(x) = 1, q(x) = 20, r(x) = 10 and f(a) =1. 


and C, and C> are constants chosen so as to ensure that u(0) = 0 and 
u(1) = 0; hence, 
e2 — 1 1—e! 
SAG aia Oo = Tt 
10 (e*1 — e%2) 10 (e*1 — er2) 
The function u is shown in Figure 14.4. 
Now, let us imagine for a moment that u is unknown, and let us 


C1 


compute a numerical approximation u” to u, using the adaptive finite 
element algorithm described above, so that ||w—wu"||,2(0,1) < TOL, where 
TOL = 10~*. The computation begins on a coarse subdivision of the in- 
terval [0, 1] containing only 10 elements. This is then successively refined 
using the refinement criterion (14.48) until the stopping criterion (14.47) 
is satisfied; the resulting subdivisions are shown in Figure 14.5. In this 
example, the constant Ko appearing in (14.43) and (14.45)—(14.48) is 
(1 + W500) /m? ( 2.367). 

Since we are in the fortunate (but highly idealised) position that, in 
addition to the numerical solution wu”, the analytical solution u is also 
available, we can assess the sharpness of our a posteriori error bound 
(14.43) by comparing the error ||u — u"||;2(¢9,1) appearing on the left- 
hand side of (14.43) with the computable a posteriori error bound on 
the right-hand side of (14.43). Figure 14.6 shows that the a posteriori 
bound consistently overestimates the error ||u—wu"||;,2(0,1) by about two 
orders of magnitude. By comparing the slopes of the two curves in Figure 
14.6, we also see that the error and the a posteriori error bound decay 
at approximately the same rate as the number of mesh points increases 
in the course of mesh adaptation. © 
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Subdivision 1 with 10 elements 


Subdivision 2 with 12 elements 


Subdivision 3 with 24 elements 


Subdivision 4 with 34 elements 


Subdivision 5 with 64 elements 


Subdivision 6 with 86 elements 


Fig. 14.5. Sequence of subdivisions of the interval [0,1] designed by the adap- 
tive algorithm with TOL = 107‘. 


14.6 Notes 


For further details concerning the mathematical theory and the imple- 
mentation of the finite element method we refer to the following books. 


» D. BRAEsS, Finite Elements, Cambridge University Press, Cambridge, 
2001. 

» S. BRENNER AND L.R. Scott, The Mathematical Theory of Finite 
Element Methods, Second Edition, Springer, New York, 2002. 

» C. JOHNSON, Numerical Solution of Partial Differential Equations by 
the Finite Element Method, Cambridge University Press, Cambridge, 
1996. 


For recent results on the theory of a posteriori error estimation for finite 
element approximations of differential equations, based on duality argu- 
ments, the interested reader may wish to consult the following review 
articles. 
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\ —— True error 
\ -- Error bound 


Ju ~ whfL2(0,1) 


1 2 


10 10 


Number of mesh points 


Fig. 14.6. Comparison of the true error ||u — u"||,2(0,1) (solid curve) with the 
a posteriori error bound delivered by the adaptive algorithm (dashed curve) 
with TOL = 107*. 


» K. ERIKSON, D. Estep, P. HANSBO, AND C. JOHNSON, Introduction 
to adaptive methods for differential equations, in Acta Numerica 4 
(A. Iserles, ed.), Cambridge University Press, Cambridge, 105-158, 
1995. 

» R. BECKER AND R. RANNACHER, An optimal control approach to 
a-posteriori error estimation in finite element methods, in Acta Nu- 
merica 10 (A. Iserles, ed.), Cambridge University Press, Cambridge, 
1-102, 2001. 

» M.B. GILES AND E. Sui, Adjoint methods for PDEs: superconver- 
gence and adaptivity by duality, in Acta Numerica 11 (A. Iserles, ed.), 
Cambridge University Press, Cambridge, 145-236, 2002. 


A detailed and general survey of the subject of a posteriori error esti- 
mation can be found in 


» M. AINSWORTH AND J.T. ODEN, A posteriori Error Estimation in 
Finite Element Analysis, John Wiley & Sons, New York, 2000. 
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In this chapter we were concerned with the a priori error analysis of 
the piecewise linear finite element method in the energy norm, and its a 
posteriori error analysis in the L? norm. Using similar techniques, one 
can establish an a priori error bound in the L? norm and an a posteriori 
error bound in the energy norm. For extensions of the theory considered 
here to higher-order piecewise polynomial finite element approximations 
and generalisations to partial differential equations, the reader is referred 
to the books listed above. 


Exercises 


14.1 Given that (a,b) is an open interval of the real line, let 
He, (a,b) = {v € H'(a,b): v(a) = 0}. 
(i) By writing 


ve) =f v'@as, 


for v € Hj, (a,b) and a € [a,b], show the following (Poincaré— 
Friedrichs) inequality: 


1 
lIelli2(a,0) S 5 (6 — 4) |le'lli2(ay) Vu € Hp, (4,8) - 
(ii) By writing 


ter = f° Zlo@tag=2 f° wie)e' eae 


for v € Hp, (a,b) and « € [a,b], show the following (Agmon’s) 
inequality: 


max, |v(0)? < 2[lolhsesle"Ihaaa ¥v € Hp,(a,2). 


14.2 Given that f € L?(0,1), state the weak formulation of each of 
the following boundary value problems on the interval (0, 1): 
(a) —u’+u= f(x), u(0) = 0, u(1) = 0; 
(b) -u" +u= f(a), u(0) =0, w(1) =1; 
(c) —u’ +u= f(x), u(0) = 0, u(1) + u/(1) = 2. 
In each case, show that there exists at most one weak solution. 
14.3 Give a proof of Theorem 14.4. 
14.4 Prove Corollary 14.2. 


14.5 


14.6 


14.7 
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Consider the boundary value problem 
—pou" +rou= f(x), u(0)=0, u(1)=0, 
on the interval [0,1], where po and ro are positive constants and 
f € C*[0, 1]. Using equally spaced points 
x; = ith, i=0,1,...,n, withh=1/n, n>2, 
and the standard piecewise linear finite element basis functions 


(hat functions) y;, 7 = 1,2,...,2—1, show that the finite ele- 
ment equations for u; = u"(x;) become 


1 
—po(us—1 — 2u; + Ui41)/h? +7o(ui-1 +4 uit igi) /6 = rate pi) 


fori = 1,2,...,n—1, with up = 0 and u, = 0. By expanding 
in Taylor series, show that 


rfid = Flos) + hPF" (as) + OCF), 


Interpreting this set of difference equations as a finite difference 
approximation to the boundary value problem, as in Chapter 
13, show that the corresponding truncation error T; satisfies 


T; = Ph’ rou” (xi) + O(h’), i=1,...,n—-1, 
and use the method of Exercise 13.2 to show that 


h 2 
)- i) < Mh 
gnax ue) — wh(x,)| < Mn, 
where M is a positive constant. 
In the notation of Exercise 5 suppose that all the integrals in- 
volved in the calculation are approximated by the trapezium 
rule. Show that the system of equations becomes identical 
to that obtained from the central difference approximation in 
Chapter 13, and deduce that 
h 2 
ie \i< 
gnax ules) — w(x,)| < Mn, 
where M is a positive constant. 
Consider the differential equation 


—(p(x)u’)’ + r(x)u = f(x), a<a<b, 


with p, r and f as at the beginning of the chapter, subject to 
the boundary conditions 


—p(aju'(a) +au(a)= A, — p(b)u’(b) + Bu(b) = B, 


416 
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where a and £ are positive real numbers, and A and B are 
real numbers. Show that the weak formulation of the boundary 
value problem is 


find wu € H'(a,6) such that A(u,v) = (v) for all v € H'(a,6), 


where 


b 
A(u,v) = | [(p(a)u'(a)u' (2) + r(x)u(ax)v(x)|da 
+ au(a)u(a) + Bu(b)v(d) , 
and 
e(v) = (f,v) + Av(a) + Bo(d). 


Construct a finite element approximation of the boundary value 
problem based on this weak formulation using piecewise linear 
finite element basis functions on the subdivision 


A=%p <4 <+++ << Bpn_1 < Xn =D 


of the interval [a,b]. Show that the finite element method gives 
rise to a set of n+ 1 simultaneous linear equations with n+ 1 
unknowns u; = u"(a#;), i = 0,1,...,n. Show that this linear 
system has a unique solution. 

Comment on the structure of the matrix M € R(™+)x(r+)) 
of the linear system: (a) Is M symmetric? (b) Is M positive 
definite? (c) Is M tridiagonal? 

Given that a is a nonnegative real number, consider the differ- 
ential equation 


—u"+u= f(x) for ze (0,1), 
subject to the boundary conditions 
u(0) = 0, au(1) + u'(1) =0. 


State the weak formulation of the problem. Using continu- 
ous piecewise linear basis functions on a uniform subdivision of 
(0, 1] into elements of size h = 1/n, n > 2, write down the fi- 
nite element approximation to this problem and show that this 
has a unique solution wu’. Expand wu’ in terms of the standard 
piecewise linear finite element basis functions (hat functions) ¢;, 


14.9 
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1=1,2,...,n, by writing 
u®(z) = 5° Uipi(z) 
i=1 


to obtain a system of linear equations for the vector of unknowns 
(U4, sans U,)t. 

Suppose that a = 0, f(z) =1 and h = 1/3. Solve the result- 
ing system of linear equations and compare the corresponding 
numerical solution u’(x) with the exact solution u(x) of the 
boundary value problem. 

Consider the differential equation 


—(p(x)u’)’ + r(x)u= f(x), «x € (0,1), 


subject to the boundary conditions u(0) = 0, u(1) = 0, where 
p(x) > co > 0, r(x) > 0 for all x in the closed interval [0,1], with 
p € C10, 1], r € C[0, 1] and f € L?(0,1). Given that u” denotes 
the continuous piecewise linear finite element approximation to 
u on a uniform subdivision of [0,1] into elements of size h = 1/n, 
n > 2, show that 


Iu — w" 0,2) < Calle In2(0,1) 5 


where C, is a positive constant that you should specify. Show 
further that there exists a positive constant C’ such that 


Ju — u" |l1(0,1) < Chill fllz20,1) - 


Calculate the right-hand sides in these inequalities in the case 
when 


p(x) =1, r(x) =0, f(z) =1, 


for x € [0,1], and h = 107°. 
Consider the two-point boundary value problem 


—u"+u= f(x), «x €(0,1), u(0) =0, u(1)=0, 


with f € C?[0,1]. State the piecewise linear finite element ap- 
proximation to this problem on a nonuniform subdivision 


0O=% <a <-++<a@,=1, n> 2, 


with h; = x; — x;_-1, assuming that, for a continuous piecewise 
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linear function v”, 


: * Flav" (a)da 


has been approximated by applying the trapezium rule on each 
element [2;-1, xi]. 
Verify that the following a posteriori bound holds for the error 


between u and its finite element approximation wu’: 


e 1/2 
[|e — wu" ||12(0,1) < Ko ps HIRO Rs) 


i=l 


1/2 
+K max, 2 ( max "(x)|? +4 max ol) 
- 1l<i<n v€[e;-1,2;] If ( J ve [xi-1,2%] If ( M 
where R(u") = f(x) — (—(u")"(x) + u"(x)) for x € (24-1, 2%), 
i=1,...,n,and Ko, Ky are constants which you should specify. 

How would you use this bound to compute u to within a 
specified tolerance TOL? 


Appendix A 


An overview of results from real analysis 


In this Appendix we gather a number of results from real analysis which 
are assumed at various places in the text. Some of these will be familiar 
from any course on the subject, and no proofs are given; a small number 
may be less familiar, and we give proofs of these for completeness. 


Theorem A.1 (The Intermediate Value Theorem) Suppose that f 
is a real-valued function, defined and continuous on the closed interval 
[a,b] of R. Then, f is a bounded function on the interval [a,b] and, if y 
is any number such that 
inf f(t) <y< sup f(x), 
x€ [a,b] x€[a,b] 

then there is a number € € [a,b] such that f(€) = y. In particular, the 
infimum and the supremum of f are achieved, and can be replaced by 
MiNge[a,p] and Maxzea,p], respectively. 


The next result, known as Rolle’s Theorem, was published in an ob- 
scure book in 1691 by the French mathematician Michel Rolle (1652- 
1719) who invented the notation ¢/z for the nth root of z. 


Theorem A.2 (Rolle’s Theorem) Suppose that f is a real-valued 
function, defined and continuous on the closed interval [a,b] of R, dif- 
ferentiable in the open interval (a,b), and such that f(a) = f(b). Then, 
there exists a number € € (a,b) such that f’(€) =0. 


It is often important in our applications that the point € € (a,b), i.e., 
a<&<b. For instance it may happen that f’(a) = f’(b) = 0, as well as 
f(a) = f(b); Theorem A.2 then states that, in addition to the endpoints 
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of the interval [a,b], there is also an interior point € € (a,b) at which 
the derivative vanishes. 


Theorem A.3 (The Mean Value Theorem) Suppose that f is a 
real-valued function, defined and continuous on the closed interval {a, b] 
of R, and f is differentiable in the open interval (a,b). Then, there exists 
a number € € (a,b) such that 


Theorem A.4 (Taylor’s Theorem) Suppose that n is a nonnegative 
integer, and f is a real-valued function, defined and continuous on the 
closed interval [a,b] of R, such that the derivatives of f of order up to 
and including n are defined and continuous on the closed interval {a, 6]. 
Suppose further that f™ is differentiable on the open interval (a,b). 
Then, for each value of x in [a,b], there exists a number € = E(x) in the 
open interval (a,b) such that 


f(«) = f(a) + (a — a) f’(a) pe eres (@= a)" g(n)(q) 


n! 
a—a)rtt 
+4 ( a fF" (E) ‘ 


Theorem A.5 (Taylor’s Theorem with integral remainder) Let 
n be a nonnegative integer and suppose that f is a real-valued function, 
defined and continuous on the closed interval [a,b] of R, such that the 
derivatives of f of order up to and including n are defined and continuous 
on [a,b], f™ is differentiable on the open interval (a,b), and f+) is 
integrable on (a,b). Then, for each x € |a, 6], 

(x— a)” 


f(x) = f(a) + (w — a) f'(a) +++ + F(a) 
ml EA pom pat, 


ni 


Proof As this version of the theorem may be rather less familiar we 
include a proof. 

The theorem is trivially true for n = 0. Suppose that the theorem is 
true for some nonnegative integer, say n = k. Then, provided that f(*+)) 
is differentiable on (a,b) and f+?) is integrable on (a,b), integration 
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by parts shows that 


ie a p+) (Dat es (a or ayer ft) (a) 


use of the theorem when n = k now shows that it is also true for n = k+1. 
The proof by induction is then complete. 


Theorem A.6 (The Integral Mean Value Theorem) Suppose that 

f is a real-valued function, defined and continuous on a closed interval 

[a,b] of R, and let g be a function, defined, nonnegative and integrable 
n (a,b). Then, there exists a number € € (a,b) such that 


i * s(a)ole) de = Fe) | vo 


Proof Since f is continuous on [a, }], it is bounded on [a, J, say 
m< f(z) <M, «xe [a,d]. 
Then, as g(x) > 0 for all x € (a,b), we have that 
mg(x) < f(x)g(z) < Mg(x), re (a,b). 


Integrating these inequalities gives 


m [at nae < [sea nde <M foto 


If ie g(x)dx = 0, then the result trivially follows. If, on the other hand, 
ee g(x)dx > 0, then 


e fe F(2) 
Pa . 


The existence of the required value of € € (a,b) now follows from the 
Intermediate Value Theorem. 


Theorem A.6 obviously also holds provided that g(x) < 0 on (a,b); 
it is only important that g has constant sign on (a,b). Note also that 
we do not require that g is continuous, only that it is integrable. For 
example, Theorem A.6 will hold if f is a continuous function defined on 
[0,1] and g(x) = 2-1/2, x € (0,1). 
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Theorem A.7 (Taylor’s Theorem for several variables) Suppose 
that f is a real-valued function of n real variables, n > 1, such that f and 
all of its partial derivatives up to and including order k+ 1 are defined, 
continuous and bounded in a neighbourhood of the point a in R”. Let 
A denote an upper bound on the absolute values of all the derivatives of 
order k +1 in this neighbourhood. Then 


|Ex| < ae Ant Illa". 
Proof The proof involves the application of Theorem A.4, Taylor’s The- 
orem, to the function of one variable 


o(t) = fla +tn) 


to give a series expansion for y(1). Then, the expressions for the deriva- 
tives of y in terms of the partial derivatives of f, via the chain rule, 
yield the required result; n*+! is the number of partial derivatives of 


order k+ 1 for a function of n variables. O 
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WW W-resources 


The book would not be complete without some mention of numerical 
analysis software and software repositories on the World Wide Web. 

An excellent source of mathematical software is the Netlib Repository 
on the website 


http://www.netlib.org 


A detailed classified list of the available mathematical software libraries 
can be viewed by clicking on the Browse button on this webpage. It is 
also possible to search the repository for a specific piece of software. 

Another useful resource is the website of the ACM Transactions on 
Mathematical Software (TOMS) at 


http://math.nist.gov/toms/ 


The site maintains a well-organised repository, including a range of freely 
available packages for both numerical and symbolical computations, as 
well as a number of helpful links to various software vendors. The latter 
include the developers of Maple (a software for symbolical and numerical 
computations, scientific visualisation and programming), the makers of 
Mathematica (a software system for symbolical, numerical and graphical 
computations), the Numerical Algorithms Group (NAG), MathWorks, 
Inc., the developers of Matlab (a technical computing environment for 
high-performance numerical computation and visualisation), and many 
others. Most of the numerical experiments included in the book were 
performed by using either Matlab or Maple. 

Concerning the history of mathematics, we refer to the Mac Tutor 
history of mathematics website at St Andrews University in Scotland: 


http://www-history.mcs.st-andrews.ac.uk/history/ 


A more recent site, dedicated specifically to the history of approximation 
theory, resides on 


http://www.cs.wisc.edu/° deboor/HAT/ 
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QR algorithm, 162 
Rayleigh quotient, 170 
tridiagonal matrix, 156 
Eigenvectors, 136 
definition, 66 
inverse iteration, 166 
Jacobi’s method, 144 
orthogonal, 136 
Energy norm, 399 
Euler’s method, 317, 323 
global error, 318 
truncation error, 318 
Euler—Maclaurin formula, 211 


Finite element method, 385 
a posteriori error analysis, 402 
a priori error analysis, 397 
adaptive algorithm, 409 
basis functions, 394 
Galerkin method, 394 
Galerkin principle, 386 
interpolant, 399 
Rayleigh—Ritz principle, 386 
residual, 406 
Ritz method, 394 
subdivision, 394 
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Fixed point 
definition, 4, 108 
simple iteration, 6 
simultaneous iteration, 108 
stable, 12 
unstable, 12 

Frobenius norm, 141 


Galerkin approximation, 393 
Galerkin basis functions, 393 
Galerkin finite element method, 394 
Galerkin method, 392, 393 
Galerkin orthogonality, 397, 405 
Galerkin principle, 389 
Gauss quadrature, 277 
composite, 285 
convergence, 283 
error estimate, 282 
quadrature points, 279 
quadrature weights, 279 
Gaussian elimination, 44 
pivoting, 52 
Gerschgorin discs, 145 
Gerschgorin similarity transformation, 
149 
Gerschgorin theorems, 145 
Global convergence, 29 
Newton’s method, 31, 123 
Global error 
boundary value problem, 364 
Euler’s method, 318 
initial value problem, 317 
Gram-Schmidt orthogonalisation, 261 


H*+1(qa,b), 296 

Hat function, 297, 394 

Hermite cubic spline, 300 

Hermite interpolation, 187, 277 
error, 190 

Hilbert matrix, 72, 259 

Holder’s inequality, 61 

Householder matrix, 150 

Householder reflector, 151 

Householder’s method, 155 


Implicit methods 
linear multistep methods, 330 
one-step methods, 324 
Runge-Kutta methods, 351 
Improved Euler method, 328 
Infinity norm, 59, 65, 225 
best approximation, 228 
Initial value problems, 310 
linear multistep methods, 329 
one-step methods, 317 
Inner product, 252, 388 
inner product space, 253 


orthogonality, 253 
weight function, 255 
Integral Mean Value Theorem, 421 
Integration, 200 
composite Simpson’s rule, 210 
composite trapezium rule, 209 
Euler—Maclaurin formula, 211 
Gauss quadrature, 277 
Lobatto rule, 287 
midpoint rule, 286 
Newton—Cotes quadrature, 201 
quadrature points, 202, 279 
quadrature weights, 202 
Radau rule, 287 
Richardson extrapolation, 216 
Romberg integration, 217 
Simpson’s rule, 203 
trapezium rule, 202 
Interchanges, 52 
interchange matrix, 53 
Interlace Theorem, 157 
Intermediate Value Theorem, 419 
Interpolation, 179, 244, 292 
at Chebyshev points, 244 
cubic spline, 298 
Hermite, 187, 292 
Lagrange, 180, 292 
linear spline, 293 
Interpolation points, 182 
Inverse 
of a lower triangular matrix, 47 
of a matrix, 40 
Inverse iteration, 166 
Iteration, simple, 2 


Jacobi’s method, 149 
classical, 140 
convergence, 142 
eigenvalues, 137 
eigenvectors, 144 
serial, 143 

Jacobian matrix, 113, 346 


Kronecker delta, 267 


L2,(a,b), 225 
Lagrange interpolation, 180, 201 
error, 183 
Laguerre polynomials, 290 
Least squares solution of linear 
equations, 74 
Lebesgue integral, 256 
Legendre polynomials, 263 
Linear convergence, 12 
Linear multistep methods, 329 
A-stable, 348 
absolutely stable, 348 
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characteristic polynomials, 332 
consistency, 337 
error constant, 338 
explicit, 330 
implicit, 330 
order of accuracy, 338 
region of absolute stability, 348 
Root Condition, 332, 335 
Simpson’s rule method, 330 
truncation error, 337 
zero-stability, 331 
Lipschitz condition, 7, 11, 109, 318 
Lipschitz constant, 109 
In = log,, 5, 315 
Lobatto quadrature, 287 
Logistic equation, 30 
Lower triangular matrix, 46 
inverse, 47 
LU factorisation 
existence, 50 
of matrix, 48 
with pivoting, 53 


M-matrix, 101 
Mass matrix, 396 
Matrix 
band, 98 
bidiagonal, 164 
condition number, 58, 70 
diagonally dominant, 96 
Hilbert, 72 
lower triangular, 46 
M-matrix, 101 
monotone, 99 
orthogonal, 138 
permutation, 53 
positive definite, 87, 88, 97 
principal submatrix, 50 
strictly diagonally dominant, 96 
symmetric, 87 
tridiagonal, see Tridiagonal 
matrix 
unit lower triangular, 46 
upper triangular, 47 
Matrix factorisation 
Cholesky, 90 
LU, 48 
QR, 76, 78, 163 
Matrix norm, 58 
1-norm, 66 
2-norm, 66 
oo-norm, 65 
Frobenius norm, 141 
subordinate norm, 64 
Maximum norm, see Infinity norm 
Maximum Principle, 365, 369, 372 
comparison function, 366, 372 
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Mean Value Theorem, 8, 10, 11, 26, 113, Orthogonal matrix, 138 
420 Orthogonal polynomials, 259, 260, 277 


Midpoint rule, 286 

Minimax approximation, 230 
Minkowski’s inequality, 62 
Modified Euler method, 328 
Monic polynomial, 243 
Monotone matrix, 99 
tridiagonal, 100 


81 


Natural cubic spline, 298 
Near-minimax polynomial, 245, 270 
Neighbourhood, 105 
Newton’s method, 19, 21, 116 
convergence, 23, 116 
global behaviour, 31, 123 
simultaneous equations, 118 
Newton—Cotes quadrature, 201 
convergence, 208 
error estimate, 204 
Simpson’s rule, 203 
trapezium rule, 202 
Norm, 58, 224 
1-norm, 59, 66 
2-norm, 59, 66, 225, 252, 255 
oo-norm, 59, 65, 225 
energy norm, 399 
Frobenius norm, 141 
induced norm, 254 
normed linear space, 58, 224 
p-norm, 60 
Sobolev norm, 387 
vector and matrix norm, 58 
Normal equations, 76 


One-step methods, 317 
consistent, 321 
convergence, 322 
Euler’s method, 317, 323 
general form, 317 
implicit methods, 324 
improved Euler method, 328 
modified Euler method, 328 
order of accuracy, 323 
Runge-Kutta methods, 323, 325 
trapezium rule method, 324 
truncation error, 317 
Open ball, 63 
Open set, 104 
Operation count, 92 
Order of accuracy 
linear multistep methods, 338 
one-step methods, 323 
Orthogonal, inner product space, 252 
Orthogonal eigenvectors, 136 


Moore—Penrose generalised inverse, 70, 


construction, 260 
zeros, 269, 279 
Orthogonal transformation 
eigenvalues, 137 
invariance of sum of squares, 140 
plane rotation, 138 
Orthonormal polynomials, 265 
Oscillation Theorem, 232, 233, 243 
critical point, 233 


Pn, 180 

Permutation matrices, 53 

Piecewise polynomials, 292 

Pivoting, 52, 92, 95 

Plane rotations, 138, 163 

Poincaré—Friedrichs inequality, 414 

Positive definite matrix, 87, 97 
properties, 88 

Principal submatrix, 50 


QR algorithm, 162 

shift, 164 
QR factorisation, 76, 78, 163 
Quadratic convergence, 16, 22, 119 
Quadrature, see Integration 


R™x”, Rex, AO 

RIX", 87 

R?, 64 

Radau quadrature, 287 

Rayleigh quotient, 170 

Rayleigh—Ritz principle, 388 

Relaxation, 19 
convergence, 20, 117 
simultaneous equations, 116 

Richardson extrapolation, 216 

Ritz approximation, 393 

Ritz method, 392, 393 

Ritz projector, 398 

Rolle’s Theorem, 184, 191, 419 

Romberg integration, 217 

Row operations, 46 

Runge phenomenon, 208 

Runge-Kutta methods, 323, 325 
algebraically stable, 354 
Butcher tableau, 352 
classical fourth order, 328 
diagonally implicit (DIRK), 353 
implicit, 349 
improved Euler method, 328 
modified Euler method, 328 


Secant method, 25 
convergence, 26 
Self-adjoint problem, 370 
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Set of measure zero, 295 Subdivision, 394 
Shift, QR algorithm, 165 Sublinear convergence, 13 
Simple iteration, 2 Subordinate matrix norm, 64 
convergence, 11 Superlinear convergence, 13 
divergence, 15 Support, 394 
global behaviour, 29 Symmetric bilinear functional, 388 
Simpson’s rule, 203 Symmetric matrix, 87 
composite, 210 
error estimate, 205 Taylor’s Theorem, 420 
Simultaneous iteration, 106 several variables, 422 
convergence, 110, 113 with integral remainder, 420 
Simultaneous nonlinear equations, 104 Thomas algorithm, 95, 363 
Newton’s method, 118 Trace, 136 
Simultaneous relaxation, 116 Trapezium rule, 202 
Singular value composite, 209 
decomposition, 82 Tridiagonal matrix, 93, 363, 367, 371, 
definition, 67 373, 395 
Sobolev norm, 387 eigenvalues, 156 
Sobolev space, 296, 387 factorisation, 94 
Solution of linear equations, 44, 55 monotone, 100 
computational work, 56 reduction of real symmetric matrix, 
least squares, 74 150 
sensitivity, 71 Truncation error 
Spline, 292, 394 Euler’s method, 318 
cubic, 298 linear multistep method, 337 
end conditions, 298 one-step method, 317 
Hermite cubic, 300 
error bound, 301 Unit lower triangular matrix, 46 
interpolating cubic, 298 Upper triangular matrix, 47 
knot, 292 
linear, 293 Variational problem, 385 
basis functions, 297 Vector norm, 58 
error bound, 293 1-norm, 59 
optimum property, 296 2-norm, 59 
natural cubic, 298 oo-norm, 59 
construction, 299 p-norm, 60 
end conditions, 298 
optimum property, 300 Weak formulation, 390 
Stability polynomial, 347 Weak solution, 390 
Stable fixed point, 12 Weierstrass Theorem, 227, 283 
Stiff linear ODE system, 345 Weight function, 255, 260, 277 
Stiffness matrix, 396 
Strictly diagonally dominant matrix, ‘Young’s inequality, 61 
96 
Sturm sequence, 158 Zero-stability, 331 


Sturm-—Liouville problem, 373 Root Condition, 335 


